SRR000001.sra fails to import

MrCreosote commented 3 years ago

Code:

from biokbase.narrative.jobs.appmanager import AppManager
AppManager().run_app(
    "kb_uploadmethods/import_fastq_sra_as_reads_from_staging",
    {
        "import_type": "SRA",
        "fastq_fwd_staging_file_name": "",
        "fastq_rev_staging_file_name": "",
        "sra_staging_file_name": "SRR000001.sra",
        "sequencing_tech": "Illumina",
        "name": "SRR000001.sra_reads",
        "single_genome": 1,
        "interleaved": 0,
        "read_orientation_outward": 0,
        "insert_size_std_dev": None,
        "insert_size_mean": None
    },
    tag="release",
    version="1.0.47",
    cell_id="94e0efbc-7823-40d7-9834-7c08f8698b67",
    run_id="59b3effd-7871-499a-82b5-42909e756df7"
)

Stack trace:

Traceback (most recent call last):
  File "/kb/module/bin/../lib/kb_uploadmethods/kb_uploadmethodsServer.py", line 101, in _call_method
    result = method(ctx, *params)
  File "/kb/module/lib/kb_uploadmethods/kb_uploadmethodsImpl.py", line 914, in import_reads_from_staging
    returnVal = importer.import_reads_from_staging(params)
  File "/kb/module/lib/kb_uploadmethods/Utils/ImportReadsUtil.py", line 75, in import_reads_from_staging
    return self._run_sra_importer(params)
  File "/kb/module/lib/kb_uploadmethods/Utils/ImportReadsUtil.py", line 51, in _run_sra_importer
    return_val = self.sra_importer.import_sra_from_staging(sra_importer_params)
  File "/kb/module/lib/kb_uploadmethods/Utils/ImportSRAUtil.py", line 200, in import_sra_from_staging
    returnVal = self.ru.upload_reads(import_sra_reads_params)
  File "/kb/module/lib/installed_clients/ReadsUtilsClient.py", line 192, in upload_reads
    [params], self._service_ver, context)
  File "/kb/module/lib/installed_clients/baseclient.py", line 253, in run_job
    job_state = self._check_job(mod, job_id)
  File "/kb/module/lib/installed_clients/baseclient.py", line 220, in _check_job
    return self._call(self.url, service + '._check_job', [job_id])
  File "/kb/module/lib/installed_clients/baseclient.py", line 187, in _call
    raise ServerError(**err['error'])
installed_clients.baseclient.ServerError: Server error: -32000. 'Invalid FASTQ file - Path: /kb/module/work/tmp/e015eea6-d58a-4dd8-a992-7dd8550977ac.inter.fastq. Input Files Paths - FWD Path : /kb/module/work/tmp/import_SRA_6393b150-85df-494a-a8e9-3ac736d1d1a4/bc544bfa-dad1-4ee0-8da4-d03cfcfb44d1/SRR000001/1/fastq.fastq, REV Path : /kb/module/work/tmp/import_SRA_6393b150-85df-494a-a8e9-3ac736d1d1a4/bc544bfa-dad1-4ee0-8da4-d03cfcfb44d1/SRR000001/2/fastq.fastq.'
Traceback (most recent call last):
  File "/kb/module/bin/../lib/ReadsUtils/ReadsUtilsServer.py", line 101, in _call_method
    result = method(ctx, *params)
  File "/kb/module/lib/ReadsUtils/ReadsUtilsImpl.py", line 1078, in upload_reads
    raise ValueError(validation_error_message)
ValueError: Invalid FASTQ file - Path: /kb/module/work/tmp/e015eea6-d58a-4dd8-a992-7dd8550977ac.inter.fastq. Input Files Paths - FWD Path : /kb/module/work/tmp/import_SRA_6393b150-85df-494a-a8e9-3ac736d1d1a4/bc544bfa-dad1-4ee0-8da4-d03cfcfb44d1/SRR000001/1/fastq.fastq, REV Path : /kb/module/work/tmp/import_SRA_6393b150-85df-494a-a8e9-3ac736d1d1a4/bc544bfa-dad1-4ee0-8da4-d03cfcfb44d1/SRR000001/2/fastq.fastq.

I believe you can use the SRA toolkit to download SRR000001.sra or I can give you a copy.

All app settings were as the default.

MrCreosote commented 3 years ago

This looks like the issue:

ERROR on Line 19: Raw Sequence is shorter than the min read length: 7 < 10
ERROR on Line 55: Raw Sequence is shorter than the min read length: 4 < 10
ERROR on Line 99: Raw Sequence is shorter than the min read length: 9 < 10
ERROR on Line 307: Raw Sequence is shorter than the min read length: 5 < 10
ERROR on Line 351: Raw Sequence is shorter than the min read length: 8 < 10
ERROR on Line 619: Raw Sequence is shorter than the min read length: 3 < 10
ERROR on Line 1047: Raw Sequence is shorter than the min read length: 1 < 10
ERROR on Line 1203: Raw Sequence is shorter than the min read length: 7 < 10
ERROR on Line 1355: Raw Sequence is shorter than the min read length: 7 < 10
ERROR on Line 1559: Raw Sequence is shorter than the min read length: 4 < 10

I'm not sure what the rationale is for the minimum read length, but it's preventing import from the SRA db so maybe it needs a rethink - or at least better error messaging

krinsman commented 3 years ago

@MrCreosote How did you get the more specific error messages? Is that from the "internal logs" of an instance of KBase run on a Docker container?

I ask because I am getting a similar stacktrace in a narrative for a FASTQ file imported from IMG/JGI using the beta "JGI search" feature https://narrative.kbase.us/narrative/93286 . So I would be interested to know how to check whether it is the same error as the one reported above. Presumably the file could be invalid, but I don't know how to check that myself either.

Traceback (most recent call last):
  File "/kb/module/bin/../lib/kb_uploadmethods/kb_uploadmethodsServer.py", line 101, in _call_method
    result = method(ctx, *params)
  File "/kb/module/lib/kb_uploadmethods/kb_uploadmethodsImpl.py", line 914, in import_reads_from_staging
    returnVal = importer.import_reads_from_staging(params)
  File "/kb/module/lib/kb_uploadmethods/Utils/ImportReadsUtil.py", line 73, in import_reads_from_staging
    return self._run_fastq_importer(params)
  File "/kb/module/lib/kb_uploadmethods/Utils/ImportReadsUtil.py", line 24, in _run_fastq_importer
    return_val = self.uploader_utils.upload_fastq_file(fastq_importer_params)
  File "/kb/module/lib/kb_uploadmethods/Utils/UploaderUtil.py", line 53, in upload_fastq_file
    returnVal = self._upload_file_path(params)
  File "/kb/module/lib/kb_uploadmethods/Utils/UploaderUtil.py", line 296, in _upload_file_path
    result = ru.upload_reads(upload_file_params)
  File "/kb/module/lib/installed_clients/ReadsUtilsClient.py", line 192, in upload_reads
    [params], self._service_ver, context)
  File "/kb/module/lib/installed_clients/baseclient.py", line 253, in run_job
    job_state = self._check_job(mod, job_id)
  File "/kb/module/lib/installed_clients/baseclient.py", line 220, in _check_job
    return self._call(self.url, service + '._check_job', [job_id])
  File "/kb/module/lib/installed_clients/baseclient.py", line 187, in _call
    raise ServerError(**err['error'])
installed_clients.baseclient.ServerError: Server error: -32000. 'Invalid FASTQ file - Path: /kb/module/work/tmp/6003c459-49f8-447b-908e-ab40bf927b73/12859.3.292495.GCTCTGTA-TACAGAGC.filter-METAGENOME.fastq. Input Staging : 12859.3.292495.GCTCTGTA-TACAGAGC.filter-METAGENOME.fastq.gz.'
Traceback (most recent call last):
  File "/kb/module/bin/../lib/ReadsUtils/ReadsUtilsServer.py", line 101, in _call_method
    result = method(ctx, *params)
  File "/kb/module/lib/ReadsUtils/ReadsUtilsImpl.py", line 1078, in upload_reads
    raise ValueError(validation_error_message)
ValueError: Invalid FASTQ file - Path: /kb/module/work/tmp/6003c459-49f8-447b-908e-ab40bf927b73/12859.3.292495.GCTCTGTA-TACAGAGC.filter-METAGENOME.fastq. Input Staging : 12859.3.292495.GCTCTGTA-TACAGAGC.filter-METAGENOME.fastq.gz.

MrCreosote commented 3 years ago

@krinsman They were in the app logs - if you go to the job status tab of the app in the narrative and scroll through the logs they were a bit above the stack trace for me.

krinsman commented 3 years ago

Oh I see, yes my issue appears to be slightly different. I agree with your recommendation about better error messaging or a rethink.

1624116885.0419786: decompressing (with pigz) /kb/module/work/tmp/0251b7a5-cd8d-4372-a14b-ce6bfb616931/B1A0.3.filtered_raw_reads.fastq.gz to /kb/module/work/tmp/0251b7a5-cd8d-4372-a14b-ce6bfb616931/B1A0.3.filtered_raw_reads.fastq ...
1624120975.4096165: Validating FASTQ file /kb/module/work/tmp/0251b7a5-cd8d-4372-a14b-ce6bfb616931/B1A0.3.filtered_raw_reads.fastq
1624120975.4096515: Checking line count
1624120975.4096663: Removing blank lines and CRLF characters if any
1624134402.9522893: 372764936 lines in file
ERROR on Line 5: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:1174:23366:3035 at Lines 1 and 5
ERROR on Line 13: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:2362:1506:23797 at Lines 9 and 13
ERROR on Line 21: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:2342:8024:22592 at Lines 17 and 21
ERROR on Line 29: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:2342:9109:19930 at Lines 25 and 29
ERROR on Line 37: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:1463:17689:17660 at Lines 33 and 37
ERROR on Line 45: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:2658:22037:28761 at Lines 41 and 45
ERROR on Line 53: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:1405:2483:11710 at Lines 49 and 53
ERROR on Line 61: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:1603:16866:19899 at Lines 57 and 61
ERROR on Line 69: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:1348:13467:23343 at Lines 65 and 69
ERROR on Line 77: Repeated Sequence Identifier: A00178:76:HJ3KHDSXX:3:2625:24451:34632 at Lines 73 and 77
Finished processing /kb/module/work/tmp/0251b7a5-cd8d-4372-a14b-ce6bfb616931/B1A0.3.filtered_raw_reads.fastq with 77 lines containing 20 sequences.
There were a total of 10 errors.
Returning: 1 : FASTQ_INVALID
1624134403.1992147: Validation return code: 1
1624134403.199274: Validation failed

For example, from the error messages it's not clear to me whether there are more than 372 million lines in the file or only 77. Or why it wasn't possible to just strip out any lines with duplicate identifiers, or just give the second version of the duplicate a new identifier. The whole job ran for 5 hours and it seems possibly inefficient to run the job that long after identifying the first error on line 5, if it's already a given in advance that the sequence is going to be rejected as a result.

Admittedly I am not familiar with FASTQ and have only worked with "small" (<100 mb) FASTA files in the past. So perhaps how to fix this on my end is actually obvious/simple.

MrCreosote commented 3 years ago

For this particular error my first guess is that it's an interleaved file and it's being uploaded as non-interleaved. Interleaved files often have the same ID for the forward and reverse reads. Given that, it's probably a good thing the code didn't alter the data.

I'm pretty sure the 372M lines is accurate - the lines without a time stamp are coming from a 3rd party FASTQ validation module and you'll note that it stops checking for errors after 10 occurrences, and that's line 77, which is suspicious.

5 hours does seem long - I don't have an answer for that off the top of my head. You may wish to file a ticket with the KBase helpdesk.

kbaseapps / kb_uploadmethods

SRR000001.sra fails to import #325