thelovelab / tximport

Transcript quantification import for modular pipelines
134 stars 33 forks source link

Dealing with multispecies using tximport #53

Closed FlorianRocher closed 1 year ago

FlorianRocher commented 1 year ago

Hi,

I have a methodological question about how to deal with multispecies mapping and tximport.

I have RNA-seq samples that are a blend of 3 different species interacting with each other. So I performed the mapping with Salmon using an hybrid reference composed of the transcriptomes of those 3 species. Now I wonder how to correctly use tximport on the outputs of Salmon in order to generate one counting matrix for each those 3 species. Because normalization according to the library size will be done for each species independently I was wondering if I should separate the salmon outputs according to the species before running tximport. From the documentation of tximport, I should use "countsFromAbundance=no". This parameter will then use the NumReads column of the Salmon output which appears to partly rely on the relative abundance for each transcript. Is NumReads partly based on the library size ? If it's not I guess that dividing the results per species before or after tximport won't have any effect but if NumReads dependent of the library size then I should run tximport on the complete output of Salmon.

Thanks for your help,

Florian Rocher

mikelove commented 1 year ago

I would just pretend that the extra transcripts are from the same species, and import one long count table with tximport.

The question becomes, what are you going to do next? Are you aiming to compare alleles?

NumReads is the number of reads Salmon assigned to that particular transcript. The sum of NumReads will be the number of reads Salmon could map (it doesn't add or remove reads, just fractionally proportions them).

FlorianRocher commented 1 year ago

Hi,

Thanks for your fast answer ! Next I will do DE analysis per species (I am dealing with a plant facing two different pathogen species). So just to be sure I understood correctly what numReads represent : Numreads is a count that was estimated from unique and multi-mappers and takes into account the size of the transcript but does not take into account the library size. Am I correct ?

mikelove commented 1 year ago

NumReads does not take into account transcript size, that information is the effective length of the transcript. NumReads / length = something proportional to TPM.

The sum of NumReads over the transcripts is the mapped read count. The sum of the TPM is 1e6.

FlorianRocher commented 1 year ago

Hi,

Alright, I get it. Thank you for your time and all those insights !

Have a nice day

Florian