Open Denz1994 opened 4 years ago
Additional Details:
The approach to importing the data set in the legacy version required a parser, some filters, and a post-processor.
The parser (MoleculeSDFCombinedParser) is responsible for importing all the data from PubChem using an SDF file. This will generate two text files of molecule data (collection-molecules.txt and other-molecule.txt). Collection-molecules.txt contains molecule data for the collection boxes, while other-molecules.txt holds data for other molecules that can be built in the sim. See https://github.com/phetsims/build-a-molecule/issues/153#issuecomment-580072079 for details on how to read these entries.
At this point, we will need to filter out molecules that we don't want to build (either for pedagogical, or memory reasons). MoleculeKitFilterer and MoleculeDuplicateNameFilter handle this for us.
The last step involves MoleculePreprocessing, which will generate the structural format for our molecules in a serialize format. See Structure.txt
Action Items:
[ ] Familiarize yourself with the intended formatting and expected input/output for each component mentioned above. Docs are provided in molecule-data-readme.txt
[ ] Determine if the steps provided in molecule-data-readme.txt
are still accurate. Will these steps generate a usable data set for the ported sim?
[ ] If the legacy steps for data generation don't work as intended, then investigate the pub chem website for a modernized approach for importing the data set. This may involve a need for a new set of filters or post-processor. A good place to start would be here and more generally, the PubChem site.
[ ] Confirm with the design team to assure the filters are filtering out the correct data. There may be additional molecule classifications we don't want to feature. They should be identified and filtered as needed.
[ ] Work on porting the parser, filters, and post-processor tools into HTML5 code with support for ES6 modules.
Here is a zip file of the BAM legacy source code with the relevant content described above: build-a-molecule-java.zip
This sim requires that all possible molecules and molecule structures are defined prior to being built. This data is stored in
js/data
and was derived from PubChem. Taking a look atjs/data/
we see the current data set is comprised of:collectionMoleculesData.js
: Shortlist of Pubchem molecules used for collection boxes.otherMoleculesData.js
: Responsible for all PubChem related data with entries that can be read as described in https://github.com/phetsims/build-a-molecule/issues/153#issuecomment-580072079.structuresData.js
: Responsible for all possible structures. These structures may or may not have a correlated structure incollectionMoleculeData.js
The tools used to generate this data set have yet to be completely ported from Java and would require additional documentation. This includes handling filtering out any molecules not desired for this sim. During the design meeting on 01/31/20, it was decided to postpone this work until after publication of this sim.
Assigning to @ariel-phet for prioritization and assignment.