Closed lauraht closed 2 years ago
Hi @lauraht,
sirv_transcriptome.fasta
or the sirv_transcriptome_flat.fasta
(one line per transcript)..gz
vs .bz2
). As stated lc19_pcs109_subsample.fastq.bz2
is the original filename as submitted. The ENA wants to relabel it with the assigned run accession ERR3588903
and put it in gz format. I know ONT often uses bz2 for better compression and ENA may want the data in gz format. So all these files are the original sequenced reads before barcode removal and correction (for barcode removal we used pychopper). sirv_transcriptome_flat_non_redundant.fasta
Thank you so much Kristoffer!
This is very helpful!
Of course, happy to help!
Hello!
I was looking into the SIRV data provided by the “Data availability” section in your paper. I have a few questions about this dataset and would appreciate your advice.
1) The SIRV dataset contains the fasta files for 7 SIRV genes' sequences and gtf files for gene annotations. In the paper, you mentioned that you measured the error rate by aligning the reads to the sequences of the 68 true transcripts using minimap2. I was just wondering how you obtained the 68 true transcripts’ sequences? Did you use
gffread
to fetch the transcripts’ sequences from the gtf fileSIRVome_isoforms_Lot00141_C_170612a.gtf
by using the combined 7 genes’ one sequenceSIRVome_isoforms_170612a.fasta
as the reference (like the following)?2) About the data you deposited in the ENA, there are
ERR3588903
from the SRA andERR3588903_1.fastq.gz
. Is theERR3588903
from the SRA the original sequencing reads with errors? And is theERR3588903_1.fastq.gz
the corrected reads by isONcorrect? What is thelc19_pcs109_subsample.fastq.bz2
?3) In the paper, you also mentioned that the 68 SIRV transcripts contain five transcripts that are perfect substrings of other larger transcripts and confound the alignments, so you filtered them out to obtain 63 SIRV transcripts for analysis. I was just wondering what these five substring transcripts are? Do you have this information in this GitHub? What happened to those reads in the original fastq file that belong to these 5 transcripts? Do you just align them to those larger transcripts?
I would greatly appreciate your advice on these questions.
Thank you very much in advance!