This PR adds validate_and_assign_ids.py, as well as a stub for cli.py (which should probably be overwritten by the one that @dotsdl is writing). It also adds a ton of tests in test_validate_and_assign_ids.py and some test utilities in tests/utils.py.
The major work here is in the validate_and_assign function.
This takes as input:
input_graph_files: A list of SDF files to load containing 3D molecules. The coordinates from these molecules will be stripped, and only the molecule graph will be considered. New conformers for these molecules will be generated in subsequent steps. The same chemical species MUST NOT appear multiple times in this list of files, and an error will be raised if this is violated.
input_3d_files: A list of SDF files to load containing 3D molecules. The same chemical species MAY appear multiple times in this list of files, and each conformer provided in this manner will replace a conformer that would otherwise be generated in a subsequent step.
output_directory: optional, default=1-validate_and_assign. The directory that will hold the output of this workflow step. If this directory already exists, an Exception will be raised, prompting the user to manually delete the existing directory if they really intend to run the step again.
group_name: The three-character code for this dataset. Actually accepts any string.
Note: For now, it's permissible for the same molecule to appear in input_graph_files and input_3d_files.
This produces as output:
3D SDF files following the pattern <group_name>-<5-digit molecule ID>-<2-digit conformer ID>.sdf, for example JRW-00004-00.sdf.
These files will have their atoms indexed identically.
The numerical components of the file name are indexed beginning at zero.
The files contain SD data pairs for the information in the file name -- group_name, group_id, and conformer_index
Mapped, isomeric, explicit-hydrogen SMILES, following the pattern <group_name>-<5-digit molecule ID>.smi, for example JRW-00004.smi.
Questions
What to do if user inputs more than 10 confs? No logic for this currently.
Status
[x] Fill out test input molecules
[ ] ~Test enumerating stereoisomers(?)~ This doesn't enumerate stereoisomers
[x] Implement tests
[x] Test loading all inputs in data/molecules
[x] good input with a single molecule
[x] good input with multiple molecules
[x] ~bad~ good input with repeated molecules, which get assigned different conformer IDs
[x] bad 2d molecule (just don't accept 2D sdf at all for now) ~with defined stereo~
Description
Initial implementation, closes #8
This PR adds
validate_and_assign_ids.py
, as well as a stub forcli.py
(which should probably be overwritten by the one that @dotsdl is writing). It also adds a ton of tests intest_validate_and_assign_ids.py
and some test utilities intests/utils.py
.The major work here is in the
validate_and_assign
function.This takes as input:
input_graph_files
: A list of SDF files to load containing 3D molecules. The coordinates from these molecules will be stripped, and only the molecule graph will be considered. New conformers for these molecules will be generated in subsequent steps. The same chemical species MUST NOT appear multiple times in this list of files, and an error will be raised if this is violated.input_3d_files
: A list of SDF files to load containing 3D molecules. The same chemical species MAY appear multiple times in this list of files, and each conformer provided in this manner will replace a conformer that would otherwise be generated in a subsequent step.output_directory
: optional, default=1-validate_and_assign
. The directory that will hold the output of this workflow step. If this directory already exists, an Exception will be raised, prompting the user to manually delete the existing directory if they really intend to run the step again.group_name
: The three-character code for this dataset. Actually accepts any string.Note: For now, it's permissible for the same molecule to appear in
input_graph_files
andinput_3d_files
.This produces as output:
<group_name>-<5-digit molecule ID>-<2-digit conformer ID>.sdf
, for exampleJRW-00004-00.sdf
.group_name
,group_id
, andconformer_index
<group_name>-<5-digit molecule ID>.smi
, for exampleJRW-00004.smi
.Questions
Status
data/molecules
<groupID>-<moleculeIndex>-<conformerID>.sdf