ncbi / sra-tools

SRA Tools
Other
1.1k stars 242 forks source link

Incomplete Referential Integrity of downloaded sra files #186

Open frankysuperior opened 5 years ago

frankysuperior commented 5 years ago

Hello, I used ascp to download SRR1792677. After successful download, I checked integrity of SRA data, the output shows that the Referential Integrity is incomplete(shown below), does this affect subsequent analysis such as mapping and quantification? if so, how can I fix it?

$vdb-validate SRR1792677 >> 1.txt 2019-03-08T16:14:48 vdb-validate.2.8.2 info: Validating '/home/yczhao/ncbi/public/sra/SRR1792677.sra'... 2019-03-08T16:14:48 vdb-validate.2.8.2 info: Database 'SRR1792677.sra' metadata: md5 ok 2019-03-08T16:14:48 vdb-validate.2.8.2 info: Table 'PRIMARY_ALIGNMENT' metadata: md5 ok 2019-03-08T16:14:48 vdb-validate.2.8.2 info: Column 'GLOBAL_REF_START': checksums ok 2019-03-08T16:14:48 vdb-validate.2.8.2 info: Column 'HAS_MISMATCH': checksums ok 2019-03-08T16:14:48 vdb-validate.2.8.2 info: Column 'HAS_REF_OFFSET': checksums ok 2019-03-08T16:14:49 vdb-validate.2.8.2 info: Column 'MISMATCH': checksums ok 2019-03-08T16:14:49 vdb-validate.2.8.2 info: Column 'REF_LEN': checksums ok 2019-03-08T16:14:49 vdb-validate.2.8.2 info: Column 'REF_OFFSET': checksums ok 2019-03-08T16:14:49 vdb-validate.2.8.2 info: Column 'REF_OFFSET_TYPE': checksums ok 2019-03-08T16:14:49 vdb-validate.2.8.2 info: Column 'REF_ORIENTATION': checksums ok 2019-03-08T16:14:49 vdb-validate.2.8.2 info: Column 'SEQ_READ_ID': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'SEQ_SPOT_ID': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Table 'REFERENCE' metadata: md5 ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'CGRAPH_HIGH': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'CGRAPH_INDELS': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'CGRAPH_LOW': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'CGRAPH_MISMATCHES': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'CIRCULAR': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'CS_KEY': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'OVERLAP_REF_LEN': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'OVERLAP_REF_POS': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'PRIMARY_ALIGNMENT_IDS': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'SEQ_ID': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'SEQ_LEN': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'SEQ_START': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Table 'SEQUENCE' metadata: md5 ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'ALIGNMENT_COUNT': checksums ok 2019-03-08T16:14:50 vdb-validate.2.8.2 info: Column 'CMP_ALTREAD': checksums ok 2019-03-08T16:14:52 vdb-validate.2.8.2 info: Column 'CMP_READ': checksums ok 2019-03-08T16:14:55 vdb-validate.2.8.2 info: Column 'PRIMARY_ALIGNMENT_ID': checksums ok 2019-03-08T16:15:34 vdb-validate.2.8.2 info: Column 'QUALITY': checksums ok 2019-03-08T16:15:46 vdb-validate.2.8.2 info: Column 'READ_TYPE': checksums ok 2019-03-08T16:16:39 vdb-validate.2.8.2 info: Referential Integrity: SEQ_SPOT_ID <-> PRIMARY_ALIGNMENT_ID 66.8% complete 2019-03-08T16:16:45 vdb-validate.2.8.2 info: Database '/home/yczhao/ncbi/public/sra/SRR1792677.sra': SEQUENCE.PRIMARY_ALIGNMENT_ID <-> PRIMARY_ALIGNMENT.SEQ_SPOT_ID referential integrity ok 2019-03-08T16:17:02 vdb-validate.2.8.2 info: Referential Integrity: REF_ID <-> PRIMARY_ALIGNMENT_IDS 66.8% complete 2019-03-08T16:17:02 vdb-validate.2.8.2 info: Database '/home/yczhao/ncbi/public/sra/SRR1792677.sra': REFERENCE.PRIMARY_ALIGNMENT_IDS <-> PRIMARY_ALIGNMENT.REF_ID referential integrity ok 2019-03-08T16:17:02 vdb-validate.2.8.2 info: Database 'SRR1792677.sra' is consistent

frankysuperior commented 5 years ago

I have re-acquired the SRA data, but the same issue still existed

kwrodarmer commented 5 years ago

Please be patient - we are working on it. Very sorry for the delay.

frankysuperior commented 5 years ago

Please be patient - we are working on it. Very sorry for the delay.

Thanks so much for the reply. In fact, I downloaded all the SRA data under SRP053296, and most of the SRA data shared the same problem.

frankysuperior commented 5 years ago

Moreover, when I download the SRA data under SRP093349, I failed to get some of the runs with either ascp or prefetch, below is one example:

$ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l200m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR501/SRR5019885/SRR5019885.sra ./

Session Stop (Error: Server aborted session: No such file or directory)

$prefetch SRR5019885

2019-03-09T05:42:15 prefetch.2.8.2: 1) Downloading 'SRR5019885'... 2019-03-09T05:42:15 prefetch.2.8.2: Downloading via https... 2019-03-09T05:47:13 prefetch.2.8.2 int: libs/kns/http-client.c:1480:KClientHttpStreamTimedRead: incomplete while within network system module - x 2019-03-09T05:47:13 prefetch.2.8.2: 1) failed to download SRR5019885

kwrodarmer commented 5 years ago

The problem has been reported to our data curators, and we're waiting on a response. Sorry for the delay.

frankysuperior commented 5 years ago

The problem has been reported to our data curators, and we're waiting on a response. Sorry for the delay.

Thanks for the help, I am looking forward to the solution

kwrodarmer commented 5 years ago

@wraetz - could you confirm the curators' evaluation and then post your results? @frankysuperior - thank you very much for your patience.

wraetz commented 5 years ago

I can confirm that the tool 'vdb-validate' produced confusing results. It states in one line that the referential integrity check went to 66% and not to 100% However below it states that this referential integrity check is OK. At the last line it states that 'Database 'SRR1792677.sra' is consistent'.

After discussing the issue with the author of the tool, I can say that the validation finished correctly. You can use the downloaded run(s), they are consistent. The verification-tool 'vdb-validate' has a confusing behavior of not reporting progress until the end, even if it actually was progressing to the end. This will be fixed in an upcoming release of the toolkit.