mattodd / blog

1 stars 0 forks source link

Hey Chemists! Planetary Forge Twitter Bot #2

Open mattodd opened 3 years ago

mattodd commented 3 years ago

I'm asking for advice, and potential solutions, for a twitter bot that is able to tweet out pictures of molecules to catch the eyes of chemists who may own related structures or know how to make them.

This is intended to contribute to solving the matchmaking problem of open source drug discovery: connecting projects that need molecules with those people (individuals or companies) who may be able to provide them.

Context

I gave a talk recently where I was describing how easy it is to miss relevant scientific work that is taking place elsewhere in the world (slide below). I was telling the story of how I had been scrolling through my Twitter feed and I'd screeched to a halt when I noticed a picture from one of the SGC's open lab notebooks. The structures were remarkably similar to structures we were looking at in OSM Series 3. One thing led to another and we and the SGC co-evaluated each others' molecules, but the point about the story is that I could so easily have missed it. I suspect I'm missing things all the time.

What we need is the magic bot that automatically connects people working in the open on similar molecules, SCINDR, that we've proposed (and just need 50K to implement). But we don't have that yet. Nor yet do we have a good Molecular Craigslist solution.

As I said in the talk "Finding people working on relevant things shouldn't rely on scrolling through pictures on Twitter". But then I added "although actually a bot that tweets out pictures of molecules isn't a terrible idea."

The need

We humans respond well to pictures, and we chemists respond particularly well to pictures of chemical structures. We understand them quickly and can place them in context well: Do I have that structure? Have I read about them recently or do I know someone looking at these? How would I make that?

In open source drug discovery projects we have openly available sheets of molecules that accumulate as the projects progress. These sheets contain SMILES/InChI strings. Here are examples from malaria, TB and mycetoma.

What we need is a bot that takes a random entry from one of these sheets, renders a nice chemical structure, and tweets out the picture of the molecule along with a link to where the interested person might find more info (e.g. the relevant project's landing page) as well as a chemical string (to make the tweet useful to other bots). So the idea is to attract the attention of human chemists, and make it easy for those interested people to connect to the project to see if they can help further.

This could be done with molecules that have been made already, or molecules that are needed next in a project.

Here's a mocked-up tweet, and I used the hashtag #planetaryforge to try to get across the underlying vision. It's a little like the Molecular Craigslist concept, but made specifically to catch the attention of human chemists.

The request

Anyone want to try this out by making a prototype bot? We can call it The Germinator.

greglandrum commented 3 years ago

I'd be happy to talk about it. Looks like it might not be that much work to put something together if we can agree on what it should do. :-)

miike commented 3 years ago

I'd be happy to talk about it. Looks like it might not be that much work to put something together if we can agree on what it should do. :-)

Happy to help out here as well!

A few questions:

mattodd commented 3 years ago

Hi @greglandrum @miike Belated thanks for questions and enthusiasm.

1) Bot knows to look once each day at a certain spreadsheet(s). It picks a random row. 2) It takes a string and renders the structure as a picture for humans 3) It formulates a post that includes the pic, the string and some pre-arranged hashtag(s). 4) It includes a URL where the molecule is featured or required. That URL is already in the spreadsheet, so it's just coping and pasting, not thinking. 5) It spits out a tweet or equivalent, so that the output looks roughly like the above example.

Publishing across multiple platforms - very good, though the "thoughtlessness" of it fits better with Twitter than e.g. Linkedin. Sanity check: ideally not needed.

What do you think? Anything in that workflow missing?

miike commented 3 years ago
  1. That should be easy enough - do we need a mechanism for preventing it picking the same row again (or should it have a cool-off period before it can pick that row again?)
  2. That should be fine
  3. I'm guessing the hashtags can be defined in the sheet?
  4. That should also be easy enough

I don't think we'll run into any character limits (280) though we may get close if we're posting InChI + SMILES.

Publishing across multiple platforms can probably be done with a social media management tool - which would make having things like queues / scheduling a little bit easier.

image

mattodd commented 3 years ago

Nice @miike ! I think we needn't worry about duplicates, i.e. the bot picking the same molecule. Twitter's a waterfall. Yes, hashtags could be in the sheet I guess. Would need to think how. Let's say someone comes along and says "I have a sheet of 100 molecules we need for my open science project, could you please have the Planetary Forge Twitterbot look at it every now and again" then we could ask that person to provide a hashtag in a certain place, or a certain column. I wonder whether on a per-sheet or per-row basis.

mattodd commented 3 years ago

Real example, to test this out. Let's say we have this molecule

PKIS5 Starting Point

which is OCC@HCOC1=CC2=NC=C3C(NC4=C3C=CC(C#N)=C4)=C2C=C1OC and SEJTVQTXSCTJDG-LBPRGKRZSA-N

The molecule is derived from this page: https://github.com/mattodd/SGC_Sandbox So let's say all this is in one row of some spreadsheet somewhere and there's also a column of that sheet where (in this row we're talking about) there is a hashtag #kinase and the bot auto-adds in (each time) the hashtag #planetaryforge (let's say). Is that enough for the tweet to render?

I'm assuming that the tweets would need to come from a dedicated account we set up for that purpose?

mattodd commented 3 years ago

Hey @alintheopen I bet the above pic from @miike reminds you of that Wanted posted you made for Series 1 and which drew in @PatrickThomson's expertise.

miike commented 3 years ago

@mattodd Yes, we'll need a separate / dedicated Twitter account to render this from.

I've mocked up a proof of concept (without hashtags, as the sheet doesn't have them at the moment) to demonstrate here - https://opensourcemalaria.github.io/render.html

Refreshing the page should generate a new molecule.

mattodd commented 3 years ago

@miike oh my goodness that's awesome. Let me get that Twitter account so we can try it.

mattodd commented 3 years ago

Hi @miike OK so I set up https://twitter.com/PlanetaryForge - what do you need?

miike commented 3 years ago

Want to just flick the password to me over email?

On Mon, 17 May 2021, 07:31 Mat Todd, @.***> wrote:

Hi @miike https://github.com/miike OK so I set up https://twitter.com/PlanetaryForge - what do you need?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattodd/blog/issues/2#issuecomment-841880437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKST754T6X6OX6PHBNNGLDTOA2R3ANCNFSM4WIVWFSA .

miike commented 3 years ago

@mattodd Alright - we're up and running at https://twitter.com/PlanetaryForge

Things we need to decide:

miike commented 3 years ago

For the moment I've stuck with just the malaria sheet, and a post once per day at 9AM (though this is easy enough to change). There hasn't been any build failures so far so that's a good sign.

mattodd commented 3 years ago

Hi @miike - sorry for the delay. I see it's working! Amazing work! Thank you so much!

So at the moment: 1) The tweets don't include URLs for people to find out more. This would, I guess, be a manual thing: in any spreadsheet that is used, there would need to be a column containing a URL that would be included in the tweet, correct? For the OSM sheet, for example, should I just add that column in? I'm wanting people to be taken to a page where they can find out more about why that molecule is of interest. 2) If we wanted to add another sheet to the bot's repertoire, how do we do that? 3) The picture of the molecule in the tweet doesn't scale to the space available - i.e. the molecular structure is cut off. Is there a way to solve this so that the whole molecule appears? I guess this is an issue when looking at the tweets in Tweetdeck or on the webpage (e.g. the below). But actually it works OK on my phone, I just checked. So maybe that's just a client thing?

Screenshot 2021-06-09 at 10 23 36
miike commented 3 years ago

No worries @mattodd

  1. The tweets don't include URLs for people to find out more. This would, I guess, be a manual thing: in any spreadsheet that is used, there would need to be a column containing a URL that would be included in the tweet, correct?

Yes, exactly. There's no URL at the moment in the OSM sheet so I've omitted it at the moment. Once we've got the column name we can specify it here: https://github.com/OpenSourceMalaria/planetaryforgebot/blob/main/sheets.py#L14 and it'll pick up the URL going forward and add it to the tweet.

  1. If we wanted to add another sheet to the bot's repertoire, how do we do that?

At the moment there is three sheets https://github.com/OpenSourceMalaria/planetaryforgebot/blob/main/sheets.py and we just add a configuration file for each sheet (sheet id, sheet number and the labels for the columns). We then just need to setup a schedule for each sheet (e.g., 9 AM for all sheets, 9AM / 10 AM / 11 AM for the three sheets etc). That schedule is also setup in Github, and takes effect immediately (this currently uses Github Actions which is free for us). Currently there's only one schedule (for OSM) but it only takes a few minutes to add in new sheets / schedules.

  1. The picture of the molecule in the tweet doesn't scale to the space available - i.e. the molecular structure is cut off. Is there a way to solve this so that the whole molecule appears?

Twitter uses a saliency algorithm ( https://blog.twitter.com/engineering/en_us/topics/insights/2021/sharing-learnings-about-our-image-cropping-algorithm.html ) when it generates the previewed images. Some of the time this is good - if the image contains faces / people / artwork in theory using an algorithm will give a good result according to eye gaze analysis.

In our case I don't think their algorithm performs particularly well on molecules - where the information density is spread out deliberately across a larger area. I have no idea what eye gaze analysis would look like on molecules, though I suspect it's an interesting problem but I don't believe we can "force" it to generate a better preview image without trying to reverse engineer the cropping algorithm. This is possible - they provide some code to the APIs they are using, but I think it would be tricky ( https://github.com/twitter-research/image-crop-analysis/blob/main/notebooks/Image%20Annotation%20Dash.ipynb ) and we'd likely need to manipulate the image into fooling the cropping to pay attention to the whole molecule.

On iOS / Android this year they started dropping the saliency algorithm altogether which might explain why it looks better on mobile.

In March, we began testing a new way to display standard aspect ratio photos in full on iOS and Android — meaning without the saliency algorithm crop.

alintheopen commented 3 years ago

This is cool - how might it feed in molecules from other teams too? Would people just upload a sheet of all molecules in their library? And have a 'wanted' or 'ready to share' tag/poster?

miike commented 3 years ago

This is cool - how might it feed in molecules from other teams too? Would people just upload a sheet of all molecules in their library? And have a 'wanted' or 'ready to share' tag/poster?

I'm not too sure yet but something like this could work. Ideally they would have a Google sheet that they could copy and then fill in their own molecules, make a pull request and then they would start showing up.

I'd like to have some notion of autodiscovery (e.g., SCINDR) but I'm not too sure where we would start with that.

miike commented 3 years ago

Might be beyond the scope of this issue but can you think of a way that we might be able to open this up wider / to more data sources @mattodd ?

mattodd commented 3 years ago

Oh sure, that's totally the plan. I just wanted to stress test the bot so that we could make a "checklist" of what people would need to provide so that they can include their own list of molecules for the bot to scrape and advertise. It's intended to help any open drug discovery project that hits certain data availability criteria. Let me re-engage with how it's currently doing and try out a new spreadsheet. Maybe by Friday for the next Open Source Antibiotics meeting.

miike commented 3 years ago

Sounds good - I'll open the repo up (accidentally created it as private) and add some contributing docs if anyone would like to add a sheet in.

drc007 commented 3 years ago

@greglandrum might be able to help with the image generation. Could we use the same resource that we used for Slack? @miike @mattodd

miike commented 3 years ago

Alright - the repo is now public at https://github.com/OpenSourceMalaria/planetaryforgebot

I've also created an issue for better image generation - https://github.com/OpenSourceMalaria/planetaryforgebot/issues/2

greglandrum commented 3 years ago

I have a web service that can be used to generate either SVG or PNG from SMILES or Mol/SDF.

I’m happy to walk through how to use that if anyone’s interested

Alternatively, the RDKit JavaScript bindings could also be used to make the whole thing self contained if someone is writing an app in JS: https://github.com/MichelML/rdkit-js

drc007 commented 3 years ago

My thought would be to keep it simple so just use the web service.

greglandrum commented 3 years ago

Makes sense to me. I can provide a quick overview of how to use the service if someone is interested

miike commented 3 years ago

That'd be great @greglandrum . I have the local version working in a branch (I'm hoping that Twitter will preserve the metadata in the Cairo PNGs) but if the web service makes it easier to consume that'd work too.

drc007 commented 3 years ago

@greglandrum @miike Can we use the web service to add calculated properties as we did for Slack?

miike commented 3 years ago

@drc007 I'm not too familiar with the Slack bot but I imagine this should be possible.

greglandrum commented 3 years ago

@greglandrum @miike Can we use the web service to add calculated properties as we did for Slack?

The calculated properties are generated by the slack service itself and that one isn't really suitable for use by other applications. When I have the time to revisit how I'm building lambdas I can also create one which does properties

greglandrum commented 3 years ago

That'd be great @greglandrum . I have the local version working in a branch (I'm hoping that Twitter will preserve the metadata in the Cairo PNGs) but if the web service makes it easier to consume that'd work too.

The OpenAPI documentation for the service is here: https://mol-renderer2-dev.t5ix.io/apidocs/ I'd suggest ignoring the canon_smiles service (which is just there for "historical reasons"), but to_3d and to_png may be useful. I plan to switch these over to using AWS Lambda at some point in the hopefully not too distant future, but there's some infrastructure work I need to do before I can do that. That would just change the URL(s) to access the services though, the API will remain the same.

miike commented 3 years ago

That'd be great @greglandrum . I have the local version working in a branch (I'm hoping that Twitter will preserve the metadata in the Cairo PNGs) but if the web service makes it easier to consume that'd work too.

The OpenAPI documentation for the service is here: https://mol-renderer2-dev.t5ix.io/apidocs/ I'd suggest ignoring the canon_smiles service (which is just there for "historical reasons"), but to_3d and to_png may be useful. I plan to switch these over to using AWS Lambda at some point in the hopefully not too distant future, but there's some infrastructure work I need to do before I can do that. That would just change the URL(s) to access the services though, the API will remain the same.

Thanks - I'll aim to get this added in over the next couple of days. In the mean time I'm just using local rdkit which is now generating much nicer images then before: Before: https://twitter.com/PlanetaryForge/status/1408715711518855170 After: https://twitter.com/PlanetaryForge/status/1409285564839378944

@drc007 @greglandrum Are you able to share a little more around what the calculated properties are / how they are displayed? I might be able to figure this out given what's in place at the moment.

greglandrum commented 3 years ago

Thanks - I'll aim to get this added in over the next couple of days. In the mean time I'm just using local rdkit which is now generating much nicer images then before: Before: https://twitter.com/PlanetaryForge/status/1408715711518855170 After: https://twitter.com/PlanetaryForge/status/1409285564839378944

Yeah, the "before" picture was made using the old drawing code without a local pycairo install, so it's kind of doubly bad. The new version looks much better. :-)

For what it's worth: if you need to run the RDKit locally anyway, there's nothing to be gained from using the web service - it's just going to be calling the same code that you'd be calling. The web service is there for pure web applications

@drc007 @greglandrum Are you able to share a little more around what the calculated properties are / how they are displayed? I might be able to figure this out given what's in place at the moment.

This is what a card from the slack service looks like: image

miike commented 3 years ago

This is what a card from the slack service looks like: image

Ah thanks! @drc007 Are you thinking of the properties in the body of the tweet (in which case we're a little bit more limited in characters) or the image itself? (both are achievable in theory).

drc007 commented 3 years ago

@miike I was thinking the body of the tweet, would be very useful to have a chemically intelligent descriptor such as SMILES so folks can simply copy paste rather than needing to redraw.

I guess could include link to more details?

miike commented 3 years ago

@miike I was thinking the body of the tweet, would be very useful to have a chemically intelligent descriptor such as SMILES so folks can simply copy paste rather than needing to redraw.

I guess could include link to more details?

@drc007 - Similar to this? https://twitter.com/PlanetaryForge/status/1408715711518855170 This has the SMILES and InChiKey at the moment but we could probably add additional metadata. The image also embeds some information which means you could reload without drawing the structure. Maybe we need a download as molfile?

For the mol weight / Lipinski properties we could either add to the tweet body (easier for shorter SMILES strings) or embed in the image itself as text.

drc007 commented 3 years ago

That looks great

miike commented 3 years ago

@drc007 @greglandrum I've now added support for the same properties that the Slack service has. It should be reasonably easily to add new ones providing they are in rdkit already.

@alintheopen @mattodd I've also added a Markdown doc on how others can add their own sheets here - https://github.com/OpenSourceMalaria/planetaryforgebot/blob/main/README.md

drc007 commented 3 years ago

This will be very useful.