A question about Iso-Seq reads

KaparaNewbie commented 1 year ago

Hey there, Rajewsky lab!

I downloaded samples SRR17321895 to SRR17321901 (7 total) from the SRA. I wanted to preprocess them according to the recommended way in the isoseq3 guide, as instructed in the methods ("Processing of PacBio SMRT data"). However, the files' format is fastq rather than BAM, and when I try to run:

lima --isoseq SRR17321896.fastq primers.fasta SRR17321896.fl.fastq

I get the following warning:

| 20221211 10:15:20.228 | WARN | Attention! You are trying to demultiplex non CCS data. CLR demultiplexing is only supported with BAM/XML input! Will proceed to demultiplex each sequence individually, not grouped by ZMW!

Does it mean that the fastq files uploaded to the SRA are already processed and ready to be aligned? I started checking this formatting because (strangely?) FastQC found small amounts of illumina_small_rna_3'_adapter in the SRR17321896 sample.

Furthermore, I have an additional question, if you may. At first ("Full-length mRNA library preparation and sequencing"), you write,

To produce a comprehensive annotation of the O. vulgaris transcriptome, we therefore combined FLAM-seq with Iso-Seq, ...

But later on ("Isoform reconstruction from FLNC reads"), you write:

For Iso-Seq, FLNC reads have been mapped to the Octopus sinensis genome

Could you kindly explain this difference? During most of the paper, you refer to O. vulgaris, but in the methods (and in the gene_expression workflow here), you refer to O. sinensis. If this question results from uncareful reading, I apologize in advance.

I appreciate any help you can provide.

zolotarovgl commented 1 year ago

Dear @KaparaNewbie,

The reads deposited in SRA directory are FLNCs ( full-length non-chimeric ) according to isoseq terminology. This means they are ready to be mapped to the genome - you don't need to basecall the sequences yourself. In this repository, you can also find isoforms reconstructed using TAMA collapse tool using Octopus sinensis genome as a reference. We decided to use sinensis and not vulgaris genome as the former is much more complete and genome annotation is already pretty extensive. Those are closely related species of the same species complex only recently being recongnized as separate species.

CAVE: this means that the reconstructed isoforms contain sinensis genomic sequence! While in my opinion this is not a problem at all, we can not guarantee that both species use exactly the same isoforms. There may be individual cases where a gap in sinensis annotation would make it impossible to reconstruct the isoform in the locus. Let me know if this answer was useful and whether I can help you with downstream analyses / interpretation.

Grygoriy

KaparaNewbie commented 1 year ago

Dear @zolotarovgl, Thank you so much for taking the time to respond in great detail, both here and via email. BTW, sorry for the duplicated messages... I think you answered everything, thanks!

rajewsky-lab / octopus_microRNAs

A question about Iso-Seq reads #1