Open eecavanna opened 2 months ago
@turbomam @eecavanna this hasn't been touched in 2 weeks Removing from sprint, adding backlog label
I would like to get some buy-in from @mbthornton-lbl because I designed this buggy feature in support of automating the validation of data that can be retrieved with methods he wrote. I'm not using it, and removing that whole workflow would really slim down the nmdc-schema's Makefile.
Update: I intended for this to help with bulk get-study-related-records
operations
get-study-related-records = "src.scripts.nmdc_database_tools:cli" # todo recheck
We could also remove the targets that interact with fuseki as part of this.
Some paths forward I see:
.PHONY: pre-build
pre-build: local/gold-study-ids.yaml create-nmdc-tdb2-from-app
## getting a report of GOLD study identifiers, which might have been used a Study ids in legacy (pre-Napa) data
local/gold-study-ids.json:
curl -X 'GET' \
--output $@ \
'https://api-napa.microbiomedata.org/nmdcschema/study_set?max_page_size=999&projection=id%2Cgold_study_identifiers' \
-H 'accept: application/json'
local/gold-study-ids.yaml: local/gold-study-ids.json
yq -p json -o yaml $< | cat > $@
# can't ever be used without generating local/gold-study-ids.yaml first
STUDY_IDS := $(shell yq '.resources.[].id' local/gold-study-ids.yaml | awk '{printf "%s ", $$0} END {print ""}')
# can't ever be used without generating local/gold-study-ids.yaml first
print-discovered-study-ids:
@echo $(STUDY_IDS)
# Replace colons with hyphens in study IDs
# can't ever be used without generating local/gold-study-ids.yaml first
STUDY_YAML_FILES := $(addsuffix .yaml,$(addprefix local/study-files/,$(subst :,-,$(STUDY_IDS))))
# can't ever be used without generating local/gold-study-ids.yaml first
create-study-yaml-files-from-study-ids-list: $(STUDY_YAML_FILES)
# can't ever be used without generating local/gold-study-ids.yaml first
print-intended-yaml-files: local/gold-study-ids.yaml
@echo $(STUDY_YAML_FILES)
PS: API calls with arbitrary, high max_page_size
are risky
wc -l local/gold-study-ids.yaml
63 local/gold-study-ids.yaml
head local/gold-study-ids.yaml
resources:
- id: nmdc:sty-11-8fb6t785 gold_study_identifiers:
- gold:Gs0114675
- id: nmdc:sty-11-33fbta56 gold_study_identifiers:
- gold:Gs0110138
- id: nmdc:sty-11-aygzgv51 gold_study_identifiers:
- gold:Gs0114663
make --dry-run create-study-yaml-files-from-study-ids-list
mkdir -p local/study-files
study_file_name=`echo local/study-files/nmdc-sty-11-8fb6t785.yaml` ; \
echo $study_file_name ; \
study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
echo $study_id ; \
date ; \
time poetry run get-study-related-records \
--api-base-url https://api-berkeley.microbiomedata.org \
extract-study \
--study-id $study_id \
--output-file local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml
sed -i.bak 's/gold:/GOLD:/' local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml # kludge modify data to match (old!) schema
rm -rf local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.bak
poetry run linkml-validate --schema nmdc_schema/nmdc_materialized_patterns.yaml local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml > local/study-files/nmdc-sty-11-8fb6t785.yaml.validation.log.txt
time poetry run migration-recursion \
--schema-path nmdc_schema/nmdc_materialized_patterns.yaml \
--input-path local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml \
--output-path local/study-files/nmdc-sty-11-8fb6t785.yaml # kludge masks ids that contain whitespace
rm -rf local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml.bak
mkdir -p local/study-files
study_file_name=`echo local/study-files/nmdc-sty-11-33fbta56.yaml` ; \
echo $study_file_name ; \
study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
echo $study_id ; \
date ; \
time poetry run get-study-related-records \
--api-base-url https://api-berkeley.microbiomedata.org \
extract-study \
--study-id $study_id \
--output-file local/study-files/nmdc-sty-11-33fbta56.yaml.tmp.yaml
etc.
study_file_name=`echo local/study-files/nmdc-sty-11-8fb6t785.yaml` ; \
echo $study_file_name ; \
study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
echo $study_id ; \
date ; \
time poetry run get-study-related-records \
--api-base-url https://api-berkeley.microbiomedata.org \
extract-study \
--study-id $study_id \
--output-file local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml
local/study-files/nmdc-sty-11-8fb6t785.yaml nmdc:sty-11-8fb6t785 Thu Sep 26 11:10:57 AM EDT 2024 STUDY-ID: nmdc:sty-11-8fb6t785 SCHEMA-VERSION: 11.0.0rc22 Got study nmdc:sty-11-8fb6t785 from the NMDC database. Got 0 biosamples part_of nmdc:sty-11-8fb6t785. Traceback (most recent call last): File "
", line 1, in File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, *kwargs) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), args, **kwargs) File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 261, in extract_study raise e File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 253, in extract_study omics_processing_records = api_client.get_omics_processing_records_part_of_study(study_id) File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 75, in get_omics_processing_records_part_of_study response.raise_for_status() File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api-berkeley.microbiomedata.org/nmdcschema/omics_processing_set?filter=%7B%22part_of%22%3A+%22nmdc%3Asty-11-8fb6t785%22%7D&max_page_size=1000 real 0m3.892s user 0m1.407s sys 0m0.124s
nmdc:sty-11-8fb6t785 appears to be a real study: https://api-berkeley.microbiomedata.org/nmdcschema/ids/nmdc%3Asty-11-8fb6t785
but the command above is trying to find OmicsProcessing
s that are part of nmdc:sty-11-8fb6t785, and OmicsProcessing
has been replaced with DataGeneration
subclasses as of berkeley-schema-fy24
Also maybe there really are no DataGeneration
subclass instances that are part of that Study
?
In fact, maybe DataGeneration
subclass instances can't be part_of anything any more?
https://api-berkeley.microbiomedata.org/nmdcschema/data_generation_set?max_page_size=1
{
"resources": [
{
"id": "nmdc:omprc-11-0003fm52",
"name": "1000S_WLUP_FTMS_SPE_BTM_1_run2_Fir_22Apr22_300SA_p01_149_1_3506",
"description": "High resolution MS spectra only",
"has_input": [
"nmdc:bsm-11-jht0ty76"
],
"has_output": [
"nmdc:dobj-11-cp4p5602"
],
"processing_institution": "EMSL",
"type": "nmdc:MassSpectrometry",
"analyte_category": "nom",
"associated_studies": [
"nmdc:sty-11-28tm5d36"
],
"instrument_used": [
"nmdc:inst-14-mwrrj632"
]
}
],
"next_page_token": "nmdc:sys0qphf9j29"
}
see also
https://microbiomedata.github.io/berkeley-schema-fy24/MassSpectrometry/
so if that's still hard-coded into https://github.com/microbiomedata/berkeley-schema-fy24/blob/cd6acbee87b627b439d068b6bfeb8cb002f05d99/src/scripts/nmdc_database_tools.py#L64-L83
then maybe that script should be considered unmaintained?
In my local clone of the
berkeley-schema-fy24
repo, when I run$ make squeaky-clean all
, the console output begins with an error message:Screenshot: