Fasterq-dump only produces one .fastq file for paired-end data

DvValk commented 1 year ago

Hi,

I've downloaded the .sra file of run SRR9123299 using prefetch. Based on the metadata this file should be Illumina paired-end data. However, when I try to split the file using 'fasterq-dump' I only get one output file named: 'SRR9123299.fastq. I've tried both --split-3 and --split-files. Could it be that the authors only uploaded one fastq file when they were supposed to upload 2?

Downloaded the file using this command: prefetch -f all SRR9123299 --output-directory my_dir/

Tried to split the file using this command: fasterq-dump --split-3 my_dir/SRR9123299.sra -e 10

and this command: fasterq-dump --split-files my_dir/SRR9123299.sra -e 10

Any help or explanation would be much appreciated!

howtofindme commented 1 year ago

hi, have you found the reason why you just got one fastq files? I got the same problems

howtofindme commented 1 year ago

Hi,

I've downloaded the .sra file of run SRR9123299 using prefetch. Based on the metadata this file should be Illumina paired-end data. However, when I try to split the file using 'fasterq-dump' I only get one output file named: 'SRR9123299.fastq. I've tried both --split-3 and --split-files. Could it be that the authors only uploaded one fastq file when they were supposed to upload 2?

Downloaded the file using this command: prefetch -f all SRR9123299 --output-directory my_dir/

Tried to split the file using this command: fasterq-dump --split-3 my_dir/SRR9123299.sra -e 10

and this command: fasterq-dump --split-files my_dir/SRR9123299.sra -e 10

Any help or explanation would be much appreciated!

hi, have you found the reason why you just got one fastq files? I got the same problems

wraetz commented 1 year ago

You can see what is inside the accession with a command like this: 'vdb-dump SRR9123299 -R1'. It will display all columns of the first row. There are 2 reads per spot. The first read is biological and 100 bases long. The second read is technical and zero bases long. That means either the submitter made a mistake labeling this as paired-end data, or something went wrong processing it. In any case - there is no chance for you to get 2 reads out of this accession right now - because only one read is stored inside the accession.

howtofindme commented 1 year ago

You can see what is inside the accession with a command like this: 'vdb-dump SRR9123299 -R1'. It will display all columns of the first row. There are 2 reads per spot. The first read is biological and 100 bases long. The second read is technical and zero bases long. That means either the submitter made a mistake labeling this as paired-end data, or something went wrong processing it. In any case - there is no chance for you to get 2 reads out of this accession right now - because only one read is stored inside the accession.

Actually, my .sra file is from SRR11832836. It is paired according to the website https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&page_size=10&acc=SRR11832836&display=metadata.

but when I run vdb-dump -R1 SRR11832836 in bash, I also got only fastq file,

the output: $ vdb-dump -R1 SRR11832836 ALIGNMENT_COUNT: 4 BASE_COUNT: 25958334766 BIO_BASE_COUNT: 25958334766 CMP_BASE_COUNT: 2505215944 CMP_LINKAGE_GROUP: CMP_READ: COLOR_MATRIX: 0, 1, 2, 3, 4, 1, 0, 3, 2, 4, 2, 3, 0, 1, 4, 3, 2, 1, 0, 4, 4, 4, 4, 4, 4 CSREAD: 10030021312113000303330123120000333210301101122230231022222300213121013002321233221232013110021031 CS_KEY: T CS_NATIVE: false FIXED_SPOT_LEN: 0 CSREAD: 10030021312113000303330123120000333210301101122230231022222300213121013002321233221232013110021031 CS_KEY: T CS_NATIVE: false FIXED_SPOT_LEN: 0 LINKAGE_GROUP: CB:CAGCATAGTAAATGTG-1|UB:CTTAAGGGGC MAX_SPOT_ID: 264880967 MIN_SPOT_ID: 1 NAME: 1 PLATFORM: SRA_PLATFORM_ILLUMINA PRIMARY_ALIGNMENT_ID: 1 QUALITY: 32, 32, 32, 32, 32, 36, 36, 36, 36, 14, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 21, 36, 36, 36, 36, 36, 32, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 36, 36, 14, 36, 36, 36, 14, 32, 14, 32, 36, 36, 36, 32, 32, 36, 14, 32, 36, 36, 36, 32, 36, 32, 36, 36, 14, 32, 36, 36, 36, 36, 27, 36, 36, 36, 36, 36, 36, 32, 27, 14, 36, 32, 14, 14, 36, 36, 14, 36, 32, 32, 14, 27, 27, 14, 14 RD_FILTER: SRA_READ_FILTER_PASS READ: GGGCCCTGCAGTGCCCCGGCGCCAGCAGGGGGCGCTGGCCACCACTCTAAGCAAGAGAGCCCTGCAGTTGCCCTAGTCGCTCAGCTTGCACCCTGGCA READ_FILTER: SRA_READ_FILTER_PASS READ_LEN: 98 READ_SEG: [0, 98] READ_START: 0 READ_TYPE: SRA_READ_TYPE_BIOLOGICAL|SRA_READ_TYPE_REVERSE SIGNAL_LEN: 0 SPOT_COUNT: 264880967 SPOT_GROUP: TACAGACT SPOT_ID: 1 SPOT_LEN: 98 TRIM_LEN: 98 TRIM_START: 0

Because there is only READ_LEN: 98, is this the reason why I get only SRR11832836_1.fastq file ？ the command used for generating SRR11832836_1.fastq is this：

time fasterq-dump --threads 6 \ --split-files --include-technical ./SRR11832836/SRR11832836.sra \ --progress -O ./

All the files in my directory is like this :

SO ,why do I get only fastq file from sra, please illuminate me

thanks!

wraetz commented 1 year ago

Yes, that is the reason.

howtofindme commented 1 year ago

Yes, that is the reason.

SO, there is a problem with the sra file generated by NCBI?

if so, I WOULD remind them.

thanks

wraetz commented 1 year ago

Yes you should contact NCBI about the accessions in question.

ncbi / sra-tools

Fasterq-dump only produces one .fastq file for paired-end data #763