Closed balhoff closed 2 years ago
I have a branch that simplifies the dosdp-tools
rules for anatomical-entity*
files into one pattern rule:
https://github.com/phenoscape/pipeline/compare/simplify-dosdp-tools-rules
Part of this change renames some intermediate files to match the input files.
For example anatomical-entity-presences.ofn
is renamed to anatomical-entity-implies_presence_of.ofn
.
This is to match the input file patterns/implies_presence_of.yaml.
I still need to run the pipeline on these changes so I haven't created a PR yet.
Another area of duplication in the Makefile is some taxa
and gene
files.
We run the follow commands only replacing "taxa" with "genes":
https://github.com/phenoscape/pipeline/blob/b7a129f1587785bf7f993a2980e3a1f62b2afe62/Makefile#L640-L642 https://github.com/phenoscape/pipeline/blob/b7a129f1587785bf7f993a2980e3a1f62b2afe62/Makefile#L644-L646
Rename the "gene-" files to "genes-". (eg. rename gene-pairwise-sim.ttl
to genes-pairwise-sim.ttl
)
Then create a pattern rules for each step that is used for both the "taxa" and "genes" files.
Thoughts on this potential change @balhoff ?
I agree with all this. One gotcha is the grep
vs. grep -v
in the rank-statistics targets. We should split that out to generate taxa-profile-sizes.txt
and genes-profile-sizes.txt
in new targets from profile-sizes.txt
.
I have three final changes I would like to make for this issue.
Right now we have an embedded list of multiple find commands: https://github.com/phenoscape/pipeline/blob/cc4263e96a092418669bf3f99669e01f60873b86/Makefile#L131-L141
To parameterize this I suggest adding a new nexml-subdirs.txt file that would have the following contents:
curation-files/completed-phenex-files
curation-files/fin_limb-incomplete-files
curation-files/Jackson_Dissertation_Files
curation-files/teleost-incomplete-files/Miniature_Monographs
curation-files/teleost-incomplete-files/Miniatures_Matrix_Files
curation-files/teleost-incomplete-files/Dillman_Supermatrix_Files
curation-files/matrix-vs-monograph
The makefile we would read the input file and create find command for all of the directories.
There are 4 curl commands that download monarch data essentially the same way: https://github.com/phenoscape/pipeline/blob/cc4263e96a092418669bf3f99669e01f60873b86/Makefile#L377-L405
To simplify this I want to make a pattern rule for $(BUILD_DIR)/monarch/%.ttl
. This would require storing these output files in a subdirectory.
The following rule lists the input files twice. Once as prerequisites then again as input arguments:
https://github.com/phenoscape/pipeline/blob/cc4263e96a092418669bf3f99669e01f60873b86/Makefile#L208-L230
I think the above rule can be simplified by using the all prerequisites $^
automatic variable.
It looks like we have a rule to build the monarch file hpoa.ttl
, but it's commented out everywhere?
https://github.com/phenoscape/pipeline/blob/cc4263e96a092418669bf3f99669e01f60873b86/Makefile#L362-L363
https://github.com/phenoscape/pipeline/blob/cc4263e96a092418669bf3f99669e01f60873b86/Makefile#L374-L375
Here is a Make example for reducing duplication with robot merge
: https://github.com/balhoff/ultimate-ontology-makefile/blob/4b7a6c913d7a1a4feb71866c9e86f41260ecc365/Makefile#L39
A place to start might be the DOS-DP patterns.