ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

CADSR-VS latest submission failed to process #242

Open jvendetti opened 2 years ago

jvendetti commented 2 years ago

BioPortal shows CADSR-VS submission 164 with status "Uploaded", and nothing further.

Parsing log file at /srv/ncbo/repository/CADSR-VS/164 shows processing started, then halted:

# Logfile created on 2022-01-27 21:34:38 -0800 by logger.rb/v1.4.3
I, [2022-01-27T21:34:38.705790 #22160]  INFO -- : ["Starting to process http://data.bioontology.org/ontologies/CADSR-VS/submissions/164"]
I, [2022-01-27T21:34:38.731532 #22160]  INFO -- : ["Starting to process CADSR-VS/submissions/164"]

ncbo_cron console session shows latest submission as invalid:

> sub = LinkedData::Models::OntologySubmission.find(RDF::URI.new('http://data.bioontology.org/ontologies/CADSR-VS/submissions/164')).first
> sub.bring_remaining
> sub.valid?
=> false
> sub.errors
=> {:submissionId=>{:integer=>"Attribute `submissionId` value `164` must be a `Integer`"}}
jvendetti commented 2 years ago

Deleted corrupt / invalid submission (ID 164). Added new submission using identical properties from previous submissions. New submission is currently second in the processing queue:

> hgetall parseQueue
1) "sub:http://data.bioontology.org/ontologies/RS/submissions/209"
2) "{\"process_rdf\":true,\"index_search\":true,\"index_properties\":true,\"run_metrics\":true,\"process_annotator\":true,\"diff\":true}"
3) "sub:http://data.bioontology.org/ontologies/CADSR-VS/submissions/164"
4) "{\"process_rdf\":true,\"index_search\":true,\"index_properties\":true,\"run_metrics\":true,\"process_annotator\":true,\"diff\":true}"
jvendetti commented 2 years ago

Submission 164 successfully processed

jvendetti commented 1 year ago

Yesterday @martinjoconnor reported that CEDAR was getting 404s from BioPortal trying to access CADSR-VS value sets. During troubleshooting I discovered that submission 198 was corrupt. I deleted the corrupt submission and removed associated files from the production server, and the 404s were resolved.

However, there's a new version of the value set out there, and the nightly pull process failed to fully ingest. The latest submission shows as "Uploaded" with no other statuses. The REST endpoint returns an empty set for the latest_submission endpoint, i.e.: https://data.bioontology.org/ontologies/CADSR-VS/latest_submission:

{ }

In an ncbo_cron console session, the newly created submission object shows the following error:

$ pry(main)> sub = LinkedData::Models::OntologySubmission.find(RDF::URI.new('http://data.bioontology.org/ontologies/CADSR-VS/submissions/198')).first
$ pry(main)> sub.bring_remaining
$ pry(main)> sub.valid?
=> false
$ pry(main)> sub.errors
=> {:submissionId=>{:integer=>"Attribute `submissionId` value `198` must be a `Integer`"}}

The parsing.log file in /srv/ncbo/repository/CADSR-VS/198 shows:

# Logfile created on 2022-08-01 18:10:04 -0700 by logger.rb/v1.5.1
I, [2022-08-01T18:10:04.244062 #32041]  INFO -- : ["Starting to process http://data.bioontology.org/ontologies/CADSR-VS/submissions/198"]
I, [2022-08-01T18:10:04.277208 #32041]  INFO -- : ["Starting to process CADSR-VS/submissions/198"]

... and then nothing. In other words, the processing begins and then stops for unknown reasons without any error messages. This is the same sort of behavior that was originally reported on April 5th, and has clearly happened another couple of times recently.

I also looked at the scheduler-pull.log file for the nightly pull process. There are no errors in that log file to indicate that something went wrong with the pull and creation of a new submission:

I, [2022-08-01T18:01:37.830854 #31664]  INFO -- : Checking download for CADSR-VS
I, [2022-08-01T18:01:37.830940 #31664]  INFO -- : Location: https://shared.metadatacenter.org/cadsr/ontologies/CADSR-VS.owl
I, [2022-08-01T18:01:39.479255 #31664]  INFO -- : New file found for CADSR-VS
old: e0740031a00f086bffd300f79ee52f80
new: bb70a84d39dd7e57db918ff1fc0c1612

... (shortened for brevity)

I, [2022-08-01T18:02:25.071223 #31664]  INFO -- : OWLAPI Java command: parsing finished successfully.
I, [2022-08-01T18:02:25.071868 #31664]  INFO -- : Output size 183222963 in `/srv/ncbo/repository/CADSR-VS/198/owlapi.xrdf`
I, [2022-08-01T18:02:25.184741 #31664]  INFO -- : OntologyPull created a new submission (198) for ontology CADSR-VS
jvendetti commented 1 year ago

Despite deleting the corrupt submission, all subsequent attempts to pull the new version fail with the same characteristics. The only piece of debugging information that I didn't include in the last comment involved an error message I saw when trying to repair the corrupt submission (vs. ultimately deleting).

If you:

... there's a file "doesn't exist" error message still attached to the submission object.

Console session output:

$ sub = LinkedData::Models::OntologySubmission.find(RDF::URI.new('http://data.bioontology.org/ontologies/CADSR-VS/submissions/198')).first
$ sub.bring_remaining
$ sub.valid?
=> false

$ sub.errors
=> {:submissionId=>{:integer=>"Attribute `submissionId` value `198` must be a `Integer`"}}

$ sub.submissionId = 198
=> 198

$ sub.save
$ sub.valid?
=> true

$ sub.errors
=> {:pullLocation=>["File at https://shared.metadatacenter.org/cadsr/ontologies/CADSR-VS.owl does not exist"]}

It may be worth looking at the remote_file_exists? method in ontologies_linked_data in a debugger to see if anything goes awry with fetching this particular file.

In order to get the latest version into BioPortal, I've (again) deleted the corrupt submission that was a result of the nightly pull process on Aug 3rd. I created a new submission that uses the "Upload local file" instead of load from URL. Latest version successfully processed and is available in BioPortal as of now.