microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

`berkeley-schema-fy24`: `make squeaky-clean all` output begins with error message about missing file `local/gold-study-ids.yaml` #2177

Open eecavanna opened 2 months ago

eecavanna commented 2 months ago

In my local clone of the berkeley-schema-fy24 repo, when I run $ make squeaky-clean all, the console output begins with an error message:

$ make squeaky-clean all
Error: open local/gold-study-ids.yaml: no such file or directory
rm -rf project
rm -rf tmp
# ...

Screenshot:

image

ssarrafan commented 1 month ago

@turbomam @eecavanna this hasn't been touched in 2 weeks Removing from sprint, adding backlog label

turbomam commented 1 month ago

I would like to get some buy-in from @mbthornton-lbl because I designed this buggy feature in support of automating the validation of data that can be retrieved with methods he wrote. I'm not using it, and removing that whole workflow would really slim down the nmdc-schema's Makefile.

Update: I intended for this to help with bulk get-study-related-records operations

get-study-related-records = "src.scripts.nmdc_database_tools:cli" # todo recheck

We could also remove the targets that interact with fuseki as part of this.

eecavanna commented 1 month ago

Some paths forward I see:

  1. If the file has moved, update the reference and close issue
  2. If the file is gone, remove the reference and close issue
  3. If the file is obsolete, remove the reference and close issue
turbomam commented 1 month ago

.PHONY: pre-build
pre-build: local/gold-study-ids.yaml create-nmdc-tdb2-from-app

## getting a report of GOLD study identifiers, which might have been used a Study ids in legacy (pre-Napa) data
local/gold-study-ids.json:
    curl -X 'GET' \
        --output $@ \
        'https://api-napa.microbiomedata.org/nmdcschema/study_set?max_page_size=999&projection=id%2Cgold_study_identifiers' \
        -H 'accept: application/json'

local/gold-study-ids.yaml: local/gold-study-ids.json
    yq -p json -o yaml $< | cat > $@

# can't ever be used without generating local/gold-study-ids.yaml first
STUDY_IDS := $(shell yq '.resources.[].id' local/gold-study-ids.yaml  | awk '{printf "%s ", $$0} END {print ""}')

# can't ever be used without generating local/gold-study-ids.yaml first
print-discovered-study-ids:
    @echo $(STUDY_IDS)

# Replace colons with hyphens in study IDs
# can't ever be used without generating local/gold-study-ids.yaml first
STUDY_YAML_FILES := $(addsuffix .yaml,$(addprefix local/study-files/,$(subst :,-,$(STUDY_IDS))))

# can't ever be used without generating local/gold-study-ids.yaml first
create-study-yaml-files-from-study-ids-list: $(STUDY_YAML_FILES)

# can't ever be used without generating local/gold-study-ids.yaml first
print-intended-yaml-files: local/gold-study-ids.yaml
    @echo $(STUDY_YAML_FILES)
turbomam commented 1 month ago

PS: API calls with arbitrary, high max_page_size are risky

turbomam commented 1 month ago
wc -l local/gold-study-ids.yaml

63 local/gold-study-ids.yaml

head local/gold-study-ids.yaml

resources:

  • id: nmdc:sty-11-8fb6t785 gold_study_identifiers:
    • gold:Gs0114675
  • id: nmdc:sty-11-33fbta56 gold_study_identifiers:
    • gold:Gs0110138
  • id: nmdc:sty-11-aygzgv51 gold_study_identifiers:
    • gold:Gs0114663
turbomam commented 1 month ago
make --dry-run create-study-yaml-files-from-study-ids-list

mkdir -p local/study-files
study_file_name=`echo local/study-files/nmdc-sty-11-8fb6t785.yaml` ; \
        echo $study_file_name ; \
        study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
        echo $study_id ; \
        date ; \
        time poetry run get-study-related-records \
                --api-base-url https://api-berkeley.microbiomedata.org \
                extract-study \
                --study-id $study_id \
                --output-file local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml
sed -i.bak 's/gold:/GOLD:/' local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml # kludge modify data to match (old!) schema
rm -rf local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.bak
poetry run linkml-validate --schema nmdc_schema/nmdc_materialized_patterns.yaml local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml > local/study-files/nmdc-sty-11-8fb6t785.yaml.validation.log.txt
time poetry run migration-recursion \
        --schema-path nmdc_schema/nmdc_materialized_patterns.yaml \
        --input-path local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml \
        --output-path local/study-files/nmdc-sty-11-8fb6t785.yaml # kludge masks ids that contain whitespace
rm -rf local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml.bak

mkdir -p local/study-files
study_file_name=`echo local/study-files/nmdc-sty-11-33fbta56.yaml` ; \
        echo $study_file_name ; \
        study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
        echo $study_id ; \
        date ; \
        time poetry run get-study-related-records \
                --api-base-url https://api-berkeley.microbiomedata.org \
                extract-study \
                --study-id $study_id \
                --output-file local/study-files/nmdc-sty-11-33fbta56.yaml.tmp.yaml

etc.

turbomam commented 1 month ago
study_file_name=`echo local/study-files/nmdc-sty-11-8fb6t785.yaml` ; \
        echo $study_file_name ; \
        study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
        echo $study_id ; \
        date ; \
        time poetry run get-study-related-records \
                --api-base-url https://api-berkeley.microbiomedata.org \
                extract-study \
                --study-id $study_id \
                --output-file local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml

local/study-files/nmdc-sty-11-8fb6t785.yaml nmdc:sty-11-8fb6t785 Thu Sep 26 11:10:57 AM EDT 2024 STUDY-ID: nmdc:sty-11-8fb6t785 SCHEMA-VERSION: 11.0.0rc22 Got study nmdc:sty-11-8fb6t785 from the NMDC database. Got 0 biosamples part_of nmdc:sty-11-8fb6t785. Traceback (most recent call last): File "", line 1, in File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, *kwargs) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), args, **kwargs) File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 261, in extract_study raise e File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 253, in extract_study omics_processing_records = api_client.get_omics_processing_records_part_of_study(study_id) File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 75, in get_omics_processing_records_part_of_study response.raise_for_status() File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api-berkeley.microbiomedata.org/nmdcschema/omics_processing_set?filter=%7B%22part_of%22%3A+%22nmdc%3Asty-11-8fb6t785%22%7D&max_page_size=1000

real 0m3.892s user 0m1.407s sys 0m0.124s

turbomam commented 1 month ago

nmdc:sty-11-8fb6t785 appears to be a real study: https://api-berkeley.microbiomedata.org/nmdcschema/ids/nmdc%3Asty-11-8fb6t785

but the command above is trying to find OmicsProcessings that are part of nmdc:sty-11-8fb6t785, and OmicsProcessing has been replaced with DataGeneration subclasses as of berkeley-schema-fy24

Also maybe there really are no DataGeneration subclass instances that are part of that Study?

https://api-berkeley.microbiomedata.org/nmdcschema/data_generation_set?filter=%7B%22part_of%22%3A%22nmdc%3Asty-11-8fb6t785%22%7D&max_page_size=20

In fact, maybe DataGeneration subclass instances can't be part_of anything any more?

https://api-berkeley.microbiomedata.org/nmdcschema/data_generation_set?max_page_size=1

{
  "resources": [
    {
      "id": "nmdc:omprc-11-0003fm52",
      "name": "1000S_WLUP_FTMS_SPE_BTM_1_run2_Fir_22Apr22_300SA_p01_149_1_3506",
      "description": "High resolution MS spectra only",
      "has_input": [
        "nmdc:bsm-11-jht0ty76"
      ],
      "has_output": [
        "nmdc:dobj-11-cp4p5602"
      ],
      "processing_institution": "EMSL",
      "type": "nmdc:MassSpectrometry",
      "analyte_category": "nom",
      "associated_studies": [
        "nmdc:sty-11-28tm5d36"
      ],
      "instrument_used": [
        "nmdc:inst-14-mwrrj632"
      ]
    }
  ],
  "next_page_token": "nmdc:sys0qphf9j29"
}

see also

https://microbiomedata.github.io/berkeley-schema-fy24/MassSpectrometry/

so if that's still hard-coded into https://github.com/microbiomedata/berkeley-schema-fy24/blob/cd6acbee87b627b439d068b6bfeb8cb002f05d99/src/scripts/nmdc_database_tools.py#L64-L83

then maybe that script should be considered unmaintained?

turbomam commented 1 month ago

closed by