rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
265 stars 33 forks source link

How are files named in the output of parallel-fastq-dump? #42

Closed yls2g13 closed 2 years ago

yls2g13 commented 2 years ago

Hi dbGaP,

I'm writing to ask about this specific project by NYGC: '3 CANCER CELL LINES ON 2 SEQUENCERS' dbGaP accession number: phs001839.v1.p1 https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001839.v1.p1

I've reached out to NYGC for comment on what the 72x, 74x, or 75x means or what the 48,49,45,43 means in these files: HCC1143-N_72x_48_1.fastq.gz HCC1143-N_72x_48_2.fastq.gz HCC1143-N_72x_49_1.fastq.gz HCC1143-N_72x_49_2.fastq.gz HCC1143-N_74x_45_1.fastq.gz HCC1143-N_74x_45_2.fastq.gz HCC1143-N_75x_43_1.fastq.gz HCC1143-N_75x_43_2.fastq.gz

To get these files, I had to install Aspera Connect, and then use prefetch from the SRAtoolkit and then parallel-fastq-dump to extract and gzip fastq files.

NYGC said dbGaP named these files because they uploaded these BAM files and they were converted by dbGaP to FASTQs to host on the dbGaP FTP.

dbGaP is now saying that parallel-fastq-dump named these files.

Can you help me understand how files are named in the output of parallel-fastq-dump please?

Appreciate your prompt reply and thank you!

Best, Nicole

rvalieris commented 2 years ago

hello,

the filenames are created by the regular fastq-dump (parallel-fastq-dump uses fastq-dump directly).

looking at the run selector for this project: https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=phs001839

to me, it looks like these are library replicates, and the 75x is the run coverage, you can click on the SRR links and see the run coverage number matches, I don't know what the second number is, but I think its just to make each file distinct. I'm guessing that these numbers were added to differentiate each replicate.

SRR10005214: 75x SRR10109399: 72x SRR10109400: 72x SRR10109401: 74x

the only way to be sure is to either find a paper from the original authors using this data or contact the authors directly.