phetsims / build-a-molecule

"Build a Molecule" is an educational simulation in HTML5, by PhET Interactive Simulations.
GNU General Public License v3.0
8 stars 7 forks source link

New data base checks #219

Open issamali67 opened 3 years ago

issamali67 commented 3 years ago

New data base is finished. A total of 23591 molecules (few are single atom molecules will can be filtered). Need to randomly check some of the new molecules (see attached other-molecules.txt). other-molecules.txt

issamali67 commented 3 years ago

Here is the current (legacy) BaM database. legacy_other_molecules.txt

ariel-phet commented 3 years ago

@jonathanolson @issamali67 @arouinfar There seem to be a few issues with the new molecule list. Overall I think the list of new molecules is correct, but the new list has filtered out some legitimate compounds from the original list.

This excel file should provide an easy reference BAM new molecules.xlsx

Giving some examples going down the list of about 1000 molecules that do not seem to be in the new list (by pubchem id)

  1. 6347 - Appears to be legitimate
  2. 7235 - appears to be a legacy record, and should not be included
  3. 10038 seems legitimate and should be included
  4. 11190 - not found under id or name, should not be included
  5. 11535 seems legitimate

My theory is that some of these legitimate ones like 11535 something like the following is occuring -- searching 11535 brings up the legitimate record for the "Compound CID" but brings up a molecule we would not include for the "Substance SID)

So some of the filtering appears to be correct (rejecting legacy records), and some seems to be incorrect (rejecting legitimate molecules that should be included in the updated list).

Maybe @issamali67 can trouble shoot his scripts, or maybe it would be useful for QA to go through the first 100 or so of these as above and see if any other patterns emerge.

arouinfar commented 3 years ago

Thanks for creating the spreadsheet @ariel-phet! I spent ~15 minutes checking the first 50 rows in the "Molecules not in updated list from original list" section, see BAM.new.molecules-AR.xlsx. About 60% of the compounds appear to have been incorrectly eliminated. The remaining 40% were eliminated because it was a radical or the ID was a match for an irrelevant SID/PMID.

I found two borderline cases where the IDs matched CIDs of legitimate compounds. However, the name listed in the spreadsheet was not a synonym on the PubChem profile. I checked the spreadsheet for other instances of the chemical formula, but didn't find anything.

I think a good first step would be for @issamali67 to troubleshoot.

issamali67 commented 3 years ago

I think you guys checked these molecules on pubchem search, and things can be different than what you will find in the SDF files, which are used for filtering. I checked the SDF files for the 1st 100 molecules appearing in the excel file (to include the ones @arouinfar looked at). These 100 divide into 3 categories:

(1) name not in SDF: A molecule whose information mostly exist in SDF but its name does not. 52 molecules in excel are like that. Since the current data base includes molecular names, my program filters out molecules that have no names in SDF. (2) not in SDF: A molecule that does not exists in the SDF file. I found 25 such molecules. But most probably you will find them when you do a pubchem search! (3) filtered out: these are filtered out by my program and I will investigate why this is so. I found 23 molecules in this category.

Will update later on category 3.

issamali67 commented 3 years ago

Found the problem. From the molecules that are filtered out (highlighted in yellow in the attached excel): (1) there are no buckets in the current BaM sim for 3 molecules. For example, the sim does not allow to have molecules of sulfur and oxygen or boron and fluorine (see column with header "comments 2" in the attached excel). May be in older BaM version this was allowed? (2) the remaining filtered out molecules: I do 2 stages of filtration. The result of the first stage is a "first-pass". All these molecules highlighted in yellow (minus the ones above) are present in the first-pass but are filtered out in the second filtration stage. This is because if there are duplicates of a molecule, then my code takes the one with the smallest CID (I order molecules by CID), and the smallest CID may be, for example, an isotope...etc.

Will correct this and regenerate another database. BAM.new.molecules-AR-IH.xlsx

issamali67 commented 3 years ago

I fixed the problem with the filtering code. Still get around 23K molecules in the end. Many of these molecules are created after the first database came out in 2011. Attached is the excel file (contains names, formula and CID only) for checking (I already did some random ones and looks to be fine). Most of these molecules have 3d info (currently downloading this info). Few are only 2d. new_filtered_2d.txt