microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Error running makefile #1254

Open Shalsh23 opened 10 months ago

Shalsh23 commented 10 months ago

Context: I am trying to run project.makefile to test the new migration script by validating a datafile after running schema migration.

Environment: Python version 3.11.4

Steps followed:

  1. cd to root dir of nmdc-schema repo
  2. poetry update
  3. poetry install
  4. make squeaky-clean
  5. make make-rdf The output of this command throws an error as follows:
    
    rm -rf \
        OmicsProcessing.rq \
        local/mongo_as_nmdc_database.ttl \
        local/mongo_as_nmdc_database_cuire_repaired.ttl \
        local/mongo_as_nmdc_database_rdf_safe.yaml \
        local/mongo_as_nmdc_database_validation.log \
        local/mongo_as_unvalidated_nmdc_database.yaml
    poetry run gen-linkml \
        --format yaml \
        --mergeimports \
        --metadata \
        --no-materialize-attributes \
        --no-materialize-patterns \
        --useuris \
        --output nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml src/schema/nmdc.yaml
    INFO:root:Using SchemaView with im=None
    INFO:root:Importing workflow_execution_activity as workflow_execution_activity from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing core as core from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing prov as prov from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing basic_slots as basic_slots from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing external_identifiers as external_identifiers from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing sample_prep as sample_prep from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing portal/sample_id as portal/sample_id from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing linkml:types as /Users/shalkishrivastava/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MtyHo_Ar-py3.11/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/types from source src/schema/nmdc.yaml; base_dir=None
    INFO:root:Importing portal/mixs_inspired as portal/mixs_inspired from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing portal/jgi_metatranscriptomics as portal/jgi_metatranscriptomics from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing portal/jgi_metagenomics as portal/jgi_metagenomics from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing portal/emsl as portal/emsl from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing mixs as mixs from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:root:Importing annotation as annotation from source src/schema/nmdc.yaml; base_dir=src/schema
    INFO:linkml.generators.linkmlgen:Materialized file written to: nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    # probably should have made a list of classes and then looped over a parameterized version of this
    # could also assert that the range is string
    yq -i '(.classes[] | select(.name == "Biosample") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "Biosample") | .slot_usage.part_of.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "Biosample") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "DataObject") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "DataObject") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MagsAnalysisActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MagsAnalysisActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetabolomicsAnalysisActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetabolomicsAnalysisActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetagenomeAnnotationActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetagenomeAnnotationActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetagenomeAssembly") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetagenomeAssembly") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetagenomeSequencingActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetagenomeSequencingActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetaproteomicsAnalysisActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetaproteomicsAnalysisActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetatranscriptomeActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetatranscriptomeActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetatranscriptomeAnnotationActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetatranscriptomeAnnotationActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetatranscriptomeAssembly") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "MetatranscriptomeAssembly") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "NomAnalysisActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "NomAnalysisActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "OmicsProcessing") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "OmicsProcessing") | .slot_usage.part_of.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "OmicsProcessing") | .slot_usage.has_input.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "OmicsProcessing") | .slot_usage.has_output.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "OmicsProcessing") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "ReadBasedTaxonomyAnalysisActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "ReadBasedTaxonomyAnalysisActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "ReadQcAnalysisActivity") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "ReadQcAnalysisActivity") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "Study") | .slot_usage.id.pattern) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    yq -i '(.classes[] | select(.name == "Study") | .slot_usage.id.structured_pattern.syntax) = ".*"' nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    poetry run gen-linkml \
        --format yaml \
        --mergeimports \
        --metadata \
        --no-materialize-attributes \
        --materialize-patterns \
        --useuris \
        --output nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml.temp nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    INFO:root:Using SchemaView with im=None
    INFO:linkml.generators.linkmlgen:Materialized file written to: nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml.temp
    nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml.temp
    mv nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml.temp nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    date  # 276.50 seconds on 2023-08-30 without functional_annotation_agg or metaproteomics_analysis_activity_set
    Tue Oct 31 13:16:19 CDT 2023
    time poetry run pure-export \
        --client-base-url https://api.microbiomedata.org \
        --endpoint-prefix nmdcschema \
        --env-file local/.env \
        --max-docs-per-coll 10000000 \
        --mongo-db-name nmdc \
        --mongo-host localhost \
        --mongo-port 27777 \
        --output-yaml local/mongo_as_unvalidated_nmdc_database.yaml \
        --page-size 10000 \
        --schema-file src/schema/nmdc.yaml \
        --selected-collections biosample_set \
        --selected-collections data_object_set \
        --selected-collections extraction_set \
        --selected-collections field_research_site_set \
        --selected-collections library_preparation_set \
        --selected-collections mags_activity_set \
        --selected-collections metabolomics_analysis_activity_set \
        --selected-collections metagenome_annotation_activity_set \
        --selected-collections metagenome_assembly_set \
        --selected-collections metagenome_sequencing_activity_set  \
        --selected-collections metatranscriptome_activity_set \
        --selected-collections nom_analysis_activity_set \
        --selected-collections omics_processing_set \
        --selected-collections pooling_set \
        --selected-collections processed_sample_set \
        --selected-collections read_based_taxonomy_analysis_activity_set \
        --selected-collections read_qc_analysis_activity_set \
        --selected-collections study_set \
        --skip-collection-check \

selected_collections = ('biosample_set', 'data_object_set', 'extraction_set', 'field_research_site_set', 'library_preparation_set', 'mags_activity_set', 'metabolomics_analysis_activity_set', 'metagenome_annotation_activity_set', 'metagenome_assembly_set', 'metagenome_sequencing_activity_set', 'metatranscriptome_activity_set', 'nom_analysis_activity_set', 'omics_processing_set', 'pooling_set', 'processed_sample_set', 'read_based_taxonomy_analysis_activity_set', 'read_qc_analysis_activity_set', 'study_set') Attempting to get 0 documents from nmdcschema/biosample_set in pages of 10000. Retrieved 7594 entries out of 0 from nmdcschema/biosample_set Attempting to get 0 documents from nmdcschema/data_object_set in pages of 10000. warning: 524 Server Error: for url: https://api.microbiomedata.org/nmdcschema/data_object_set?max_page_size=10000 warning: FastAPI request to nmdcschema/data_object_set appears to have failed. Trying as a PyMongo query. Traceback (most recent call last): File "", line 1, in File "/Users/shalkishrivastava/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MtyHo_Ar-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shalkishrivastava/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MtyHo_Ar-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/shalkishrivastava/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MtyHo_Ar-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shalkishrivastava/Library/Caches/pypoetry/virtualenvs/nmdc-schema-MtyHo_Ar-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shalkishrivastava/shalkishrivastava_data/LBL/nmdc/nmdc-schema/nmdc_schema/mongo_dump_api_emph.py", line 354, in cli direct_data_all = nmdc_pymongo_client.get_docs_from_pymongo(current_collection, max_docs) ^^^^^^^^^^^^^^^^^^^ UnboundLocalError: cannot access local variable 'nmdc_pymongo_client' where it is not associated with a value

real 1m46.110s user 0m0.619s sys 0m0.128s make: *** [local/mongo_as_unvalidated_nmdc_database.yaml] Error 1

turbomam commented 10 months ago

let's work though this in a pair programming session

turbomam commented 10 months ago

after make squeaky-clean and before make make-rdf, please try make all test

turbomam commented 10 months ago

I think these are the critical line sin the logging:

warning: 524 Server Error: for url: https://api.microbiomedata.org/nmdcschema/data_object_set?max_page_size=10000 warning: FastAPI request to nmdcschema/data_object_set appears to have failed. Trying as a PyMongo query.

In --skip-collection-check mode, pure-export tries to get MongoDB contents through the runtime API, but it can fall back to PyMongo.

  1. I'm surprised that you got a 524 error from the runtime API. Maybe it was something transient. The previous API request for Biosample data was successful. So you could try pasting the "server error" data_object_set URL into you web browser, just with a smaller page size.
  2. Since it tried to fall back on a PyMongo connection, it looked for certain environment variables in your local/.env. We had just created that as an empty file before you started make make-rdf, because we assumed you wouldn't need the PyMongo connection. So if the server errors are persistent for you, we add the following to your local/.env, including the appropriate values
turbomam commented 10 months ago

I just tried https://api.microbiomedata.org/nmdcschema/data_object_set?max_page_size=10 and got the following:

{
  "detail": [
    {
      "type": "missing",
      "loc": [
        "query",
        "filter"
      ],
      "msg": "Field required",
      "input": null,
      "url": "https://errors.pydantic.dev/2.4/v/missing"
    },
    {
      "type": "missing",
      "loc": [
        "query",
        "page_token"
      ],
      "msg": "Field required",
      "input": null,
      "url": "https://errors.pydantic.dev/2.4/v/missing"
    },
    {
      "type": "missing",
      "loc": [
        "query",
        "projection"
      ],
      "msg": "Field required",
      "input": null,
      "url": "https://errors.pydantic.dev/2.4/v/missing"
    }
  ]
}
turbomam commented 10 months ago

Hi @Shalsh23 . I see you assigned this issue to me. What actions would you like me to take?

Shalsh23 commented 10 months ago

after make squeaky-clean and before make make-rdf, please try make all test

This did help. I was able to make progress but stumbled on another error. This error is due to perhaps the expectation of riot being installed at the specified path. Are there other tools that are expected to be installed at a particular path for this makefile to work?

INFO:root:TRUE: OCCURS SAME: Biosample == TextValue owning: Biosample
INFO:root:TRUE: OCCURS SAME: SubSamplingProcess == QuantityValue owning: SubSamplingProcess
INFO:root:FALSE: OCCURS BEFORE: OntologyClass == OntologyClass owning: ControlledIdentifiedTermValue
INFO:root:FALSE: OCCURS BEFORE: QuantityValue == QuantityValue owning: MaterialSamplingActivity
INFO:root:FALSE: OCCURS BEFORE: MaterialContainer == MaterialContainer owning: MaterialSamplingActivity
INFO:root:FALSE: OCCURS BEFORE: QuantityValue == QuantityValue owning: ReactionActivity
INFO:root:Using SchemaView with im=None

real    3m0.169s
user    2m57.688s
sys 0m1.719s
export _JAVA_OPTIONS=-Djava.io.tmpdir=local
~/apache-jena/bin//riot --validate local/mongo_as_nmdc_database.ttl # < 1 minute
bash: /Users/shalkishrivastava/apache-jena/bin//riot: No such file or directory
make: [local/mongo_as_nmdc_database.ttl] Error 127 (ignored)
date
Fri Nov  3 14:19:47 CDT 2023
time poetry run anyuri-strings-to-iris \
        --input-ttl local/mongo_as_nmdc_database.ttl \
        --jsonld-context-jsons project/jsonld/nmdc.context.jsonld \
        --emsl-biosample-uuid-replacement emsl_biosample_uuid_like \
        --output-ttl local/mongo_as_nmdc_database_cuire_repaired.ttl
Loading prefixes from project/jsonld/nmdc.context.jsonld
Loading local/mongo_as_nmdc_database.ttl
Loaded local/mongo_as_nmdc_database.ttl
Iterating over triples
Serializing to local/mongo_as_nmdc_database_cuire_repaired.ttl
Expanded CURIE literals in RDF graph.

real    1m2.875s
user    1m2.062s
sys 0m0.592s
export _JAVA_OPTIONS=-Djava.io.tmpdir=local
~/apache-jena/bin//riot --validate local/mongo_as_nmdc_database_cuire_repaired.ttl # < 1 minute
bash: /Users/shalkishrivastava/apache-jena/bin//riot: No such file or directory
make: [local/mongo_as_nmdc_database_cuire_repaired.ttl] Error 127 (ignored)
date
Fri Nov  3 14:20:50 CDT 2023
Shalsh23 commented 10 months ago

Hi @Shalsh23 . I see you assigned this issue to me. What actions would you like me to take?

I assigned it to you to formally note that you are already helping me with this issue.

turbomam commented 10 months ago

Apache jena, which includes the riot CLI can be downloaded from here: https://jena.apache.org/download/index.cgi

The project.Makefile has a JENA_PATH environment variable for the directory that contains all Jena tools.

I have opinionatedly set that to ~/apache-jena/bin/ but you can change it as long as you don't commit your change. I guess we could also put that variable assignment in the local/.env.

It may be possible to install the Jena tools system-wide with homebrew. In that case, JENA_PATH should be set to an empty string

turbomam commented 10 months ago

@Shalsh23 I really appreciate that you have stuck with this and have documented your experience. If you have lost your passion for running make make-rdf locally, it can be run with a manually-triggered GH action in any branch now