hubmapconsortium / ingest-validation-tests

0 stars 0 forks source link

Add FASTQ validation to ingest validation tests #19

Closed Daghis closed 2 years ago

Daghis commented 2 years ago

Inform future developers that working with this repository as standalone will not work and provide a reference to the parent repository.

Daghis commented 2 years ago

FYI, I think I've gotten the last few issues resolved since it's running seemingly successfully (if slowly) on hivevm193.

Daghis commented 2 years ago

If the issue is merely that the error messages being generated should be print()'d as well as collected in the List[str] that's returned, I'd prefer to address that by removing the verbose argument and just having all messages be collected and printed.

How does that sound?

Daghis commented 2 years ago

I was conflating self.errors with local errors variables. I believe I've corrected the logic so that the right value should be returned versus logged (when verbose is enabled... I left the argument in).

jswelling commented 2 years ago

I've run a test with the fix to errors vs self.errors, and that did indeed fix the problem from the earlier test. I'm running another test to check on a hint of a second problem- I think an error string from verify_fastq_filename should produce a log message, but should not cause the dataset to be invalid. Ideally the regex that chooses what to scan could be less broad, and explicitly select .fastq, .fastq.gz, (and maybe .FASTQ.gz; I have seen those around), so that files like fastq.good.gz are never included in the scan in the first place. I'm thinking I need to do a PR against the test that scans generic .gz files, so that those scanned by this test aren't scanned again. One good way to do that mod would be to import the regex used by this test and exclude anything matching that regex.

Daghis commented 2 years ago

Oh, that's right! I'd come across that regex in my searching and using, or at least replicating, that regex.

I'll take care of that tomorrow.

Daghis commented 2 years ago

I've cleaned up that part by using fastq_utils' code for finding the files in the first place, making that the canonical code for that. I suspect that my additional dependency to bring in fastq_utils might cause an issue when running as a plugin. I'll need some guidance for how to correct that if that should come to pass.

Daghis commented 2 years ago

Comment from @jswelling in Slack:

here's another delta for the FASTQ validator: when it doesn't find any FASTQ files to check, it needs to not set any errors. That way non-sequence datasets just skip this validation. I'll add that to the PR next time I touch it but I thought I'd let you know now. This was found by accident during David Betancur's experiments with running ingestions on TEST.

The most recent commit logs the no-files-found message, but doesn't return it as an error.