wacl-york / mcm-web

Code for the MCM web application
1 stars 1 forks source link

Molecular Weights export #248

Closed stulacy closed 6 months ago

stulacy commented 6 months ago

Previously the MCM provided the opportunity to export a TSV file containing Smiles, Inchi, and Mass for every species in a sub-mechanism. This was called a "Molecular Weight" export, and an example is shown here.

This would be simple to implement, as it's just a straight dump from the DB.

A few points that I'd welcome thoughts on, particularly from @AlfredMayhew if you have time.

AlfredMayhew commented 6 months ago

I've put some thoughts on your questions below. The key output for my purposes is the SMILES string as any of the other information (e.g. molecular mass) can be determined using the SMILES string and various chemical toolkits.

Another point is that it could be good to include the 'synonyms' field from the MCM species pages in this file. This is probably not a very widely used field, but it would mean that all of the information from the MCM webpages can be downloaded in some format (the species information and the mechanism info). The 'masses' file is not large so I think including as much information from the database as possible could be good, even if the information is only useful in a small portion of cases.

stulacy commented 6 months ago

Thanks Alfie. On reflection I'll probably keep the citation in, just so the file is self-documenting and portable. I'll put column headers in too. Adding the synonyms is a good shout, this could be done as a comma-separated string in one field.

The database also contains fields for 'excited' and whether a species is a peroxy radical or not (used for building the RO2 sum). Do you have any thoughts on whether these would be useful to make available? I'll also check this with Andrew as I'm not sure if the excited field has been updated recently.

stulacy commented 6 months ago

I've created a PR to add this export option back in (#252). I just need to check a few things with Andrew before making it live. In the meantime here's an example file from exporting the Isoprene mechanism (NB: file extension is normally .tsv but had to change to .txt to be able to upload it to GitHub).

Do you see any obvious problems with this file? I've kept the citation in so when you read it you just need to skip the first 46 lines.

AlfredMayhew commented 6 months ago

I read this file using pandas in python with following code:

import pandas as pd
d = pd.read_csv("mcm_export_species.txt", sep="\t", skiprows = 46, index_col=0)

I also opened the file in Excel by specifying a tab delimiter and starting the file read at line 47.

Both of these methods seem to read the file without issues. Some values are read as NaN (or blank Excel cells). This is common with the synonyms column, as many species don't have common names. However, the most extreme cases are species like SO2 which have no information. I think it's fine to keep these species in though. Users can filter them out fairly easily if needed.

I think it's very helpful to have the 'PeroxyRadical' field included. I've previously used the SMILES strings to determine whether each species is an RO2, but having the Boolean already present will help speed this up a lot! I don't do a lot of work with Criegees, which will be the only 'excited' species (I think?), so I'm not sure how helpful that field would be. However, it does seem like it could be useful if someone was trying to distinguish two identical species where the only difference is excitation. E.g. the only difference between ACLOO and ACLOOA is that one is the excited form of the other. This could be figured out from the mechanism, but it could be helpful to have that information easily accessible. Again, it's fairly easy to ignore any excess information that isn't of interest.

Once you add this export option to the website (and when I have some time), I will update my MCM_Search_Tools repo that you linked to at the beginning to use this updated file.

RolfSander commented 6 months ago

Hello @stulacy,

It's great to see that the export of Smiles, InChI and molar masses will be re-implemented. This is very helpful! It's also good to see that the InchiKey has been added as well.

I have two suggestions:

1) Although it's good to include the synonyms, I think it is dangerous to use a comma-separated string here. Many chemical names include commas, some even a comma followed by a space. For example, one of the synonyms of C23O3CCO2H is "Propanoic acid, 2-oxo-, carboxymethyl ester". Semicolons instead of commas should be a safe alternative to separate the individual names inside the synonyms field.

2) For convenience, would it be possible to add a separate column for the elemental composition of the compounds? You can always find it between the first and second / in the Inchi string.

The reason for suggesting the second point is that I need the elemental composition in my KPP files. I would like to replace the IGNORE which is currently listed in the MCM-generated KPP files. The availability of the elemental composition would enable a much more powerful analysis of the selected mechanism, e.g., mass balance checks.

stulacy commented 6 months ago

Hi @RolfSander , your suggestions are sensible and useful as always!

The elemental composition seems reasonable to me, although I'll need to run it by my colleagues here. Just for clarity, for IPROPOL with InChI=1S/C3H8O/c1-3(2)4/h3-4H,1-2H3, this would be C3H8O.

RolfSander commented 6 months ago

Yes, it's C3H8O for IPROPOL.

Thinking about it, "molecular formula" or just "formula" is probably a better name for the column header than "elemental composition".

RolfSander commented 6 months ago

Since we now have the formula available from the database, I wonder if it would be possible to replace the IGNORE in the MCM-generated KPP file by the real composition. For example, the formula of IPROPOL is C3H8O. The corresponding line in the #DEFVAR section of the KPP file should be:

IPROPOL = 3C + 8H + O ;

The format conversion could be done with a regexp like this:

formula_for_kpp = formula.gsub(/([A-Z][a-z]?)(\d+)?/, '\2\1 + ')

and then deleting the final + sign.

stulacy commented 6 months ago

Sure, that definitely sounds possible Rolf. Would you mind opening a new issue for it, just so this issue can be focused on the species list export. Thanks!

RolfSander commented 6 months ago

Yes, let's keep that separate. I've opened a new issue here: https://github.com/wacl-york/mcm-web/issues/257

stulacy commented 6 months ago

This feature has gone live in #252 so I've closed the issue