ncbo / BioPortal-to-KGX

Assemble a BioPortal Knowledge Graph
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

Some transforms fail during translation to TSV, but write placeholder #47

Closed caufieldjh closed 2 years ago

caufieldjh commented 2 years ago

In a run of the following

python run.py --input [path to data] --write_curies --remap_types --get_bioportal_metadata --ncbo_key [key]

some transforms appear to translate just fine until the final step, at which point they are not written to TSV, and here's an example with BCO:

Starting on /home/harry/Bioportal/4store-export-2022-02-02/data/46/1e/11adde7246a87e69de4d22d0b112
ROBOT report(s) present: robot.report
KGX validation log present: kgx_validate_BCO_11.log
BioPortal metadata not found for BCO_11 - will retrieve.
Accessing https://data.bioontology.org/ontologies/BCO/...
<Response [200]>
Accessing https://data.bioontology.org/ontologies/BCO/latest_submission...
<Response [200]>
Retrieved metadata for BCO (Biological Collections Ontology)
File for BCO_11 is empty! Writing placeholder.

In this case, the only contents of the output directory for BCO are:

BCO_11
BCO_11_relaxed.json
kgx_validate_BCO_11.log
robot.measure
robot.report

Edge and nodefiles are not present.

I've also seen this happen with DISDRIV and COGAT and a few others. Should try a fresh (completely from scratch) transform on these.

caufieldjh commented 2 years ago

This happens due to the slightly strange way the 4store dump is set up - some files are genuinely empty, aside from a header:

$ more /home/harry/Bioportal/4store-export-2022-02-02/data/46/1e/11adde7246a87e69de4d22d0b112
## GRAPH https://data.bioontology.org/ontologies/BCO/submissions/11

The 4store dump log indicates that the true export location for BCO is data/9b/19/d4896d9b5cc63cc7910d4ea34141, which looks like:

$ more /home/harry/Bioportal/4store-export-2022-02-02/data/9b/19/d4896d9b5cc63cc7910d4ea34141
## GRAPH http://data.bioontology.org/ontologies/BCO/submissions/11
<http://rs.tdwg.org/dwc/terms/associatedTaxa> <http://www.w3.org/2000/01/rdf-schema#isDefinedBy> "http://rs.tdwg.org/dwc/terms/"^^<http://w
ww.w3.org/2001/XMLSchema#string> .
<http://rs.tdwg.org/dwc/terms/associatedTaxa> <dcterms:description> "This term can be used to provide a list of associations to Taxa other 
than the one defined in the Occurrence. Note that the ResourceRelationship class is an alternative means of representing associations, and 
with more detail. This term is not apt for establishing relationships between Taxa, only between specific Occurrences of an Organism with o
ther Taxa. Recommended best practice is to separate the values in a list with space vertical bar space ( | )."^^<http://www.w3.org/2001/XML
Schema#string> .
...

etc.

So the loader often finds two sets of dumps for the same ontology, and one of them is empty. Not a true issue, just a bit strange.