Questions about the SIRV data used in the isONcorrect paper

lauraht commented 2 years ago

Hello!

I was looking into the SIRV data provided by the “Data availability” section in your paper. I have a few questions about this dataset and would appreciate your advice.

1) The SIRV dataset contains the fasta files for 7 SIRV genes' sequences and gtf files for gene annotations. In the paper, you mentioned that you measured the error rate by aligning the reads to the sequences of the 68 true transcripts using minimap2. I was just wondering how you obtained the 68 true transcripts’ sequences? Did you use gffread to fetch the transcripts’ sequences from the gtf file SIRVome_isoforms_Lot00141_C_170612a.gtf by using the combined 7 genes’ one sequence SIRVome_isoforms_170612a.fasta as the reference (like the following)?

gffread SIRVome_isoforms_Lot00141_C_170612a.gtf -g SIRVome_isoforms_170612a.fasta -w SIRVome_isoforms_Lot00141_C_170612a.fasta

2) About the data you deposited in the ENA, there are ERR3588903 from the SRA and ERR3588903_1.fastq.gz. Is the ERR3588903 from the SRA the original sequencing reads with errors? And is the ERR3588903_1.fastq.gz the corrected reads by isONcorrect? What is the lc19_pcs109_subsample.fastq.bz2?

3) In the paper, you also mentioned that the 68 SIRV transcripts contain five transcripts that are perfect substrings of other larger transcripts and confound the alignments, so you filtered them out to obtain 63 SIRV transcripts for analysis. I was just wondering what these five substring transcripts are? Do you have this information in this GitHub? What happened to those reads in the original fastq file that belong to these 5 transcripts? Do you just align them to those larger transcripts?

I would greatly appreciate your advice on these questions.

Thank you very much in advance!

ksahlin commented 2 years ago

Hi @lauraht,

I don't remember. Either by gffread or python script. Regardless, you can get the 68 transcripts from here. Either the sirv_transcriptome.fasta or the sirv_transcriptome_flat.fasta (one line per transcript).
None of the files on ENA contain the corrected reads. I'm fairly sure these three columns are the same reads just in different versions and compression (see .gz vs .bz2). As stated lc19_pcs109_subsample.fastq.bz2 is the original filename as submitted. The ENA wants to relabel it with the assigned run accession ERR3588903 and put it in gz format. I know ONT often uses bz2 for better compression and ENA may want the data in gz format. So all these files are the original sequenced reads before barcode removal and correction (for barcode removal we used pychopper).
We filtered these transcripts out before read alignment not to confuse the mapper (minimap2), so there are no reads that need to be "removed". Instead, simply exclude these 5 transcripts before doing the mapping analysis. The 63 unique transcripts are found here as sirv_transcriptome_flat_non_redundant.fasta

lauraht commented 2 years ago

Thank you so much Kristoffer!

This is very helpful!

ksahlin commented 2 years ago

Of course, happy to help!

ksahlin / isONcorrect

Questions about the SIRV data used in the isONcorrect paper #13