setting random seed? - Githubissues

mhammell-laboratory / TElocal

A package for quantifying transposable elements at a locus level for RNAseq datasets.

GNU General Public License v3.0

21 stars 8 forks source link

setting random seed? #16

Closed ax-ekk closed 2 years ago

ax-ekk commented 2 years ago

Hi!

Thanks very much for this great tool!

I could not find any paper describing this tool, so I have to ask. I am running this on a (very) small test dataset and I have noticed that the output varies slightly between runs (even though the input and index are the same). It's one read or so that is sometimes assigned to one gene/TE and sometimes to another. I assume this means that the random seed is not set by default? Is there a way I can set it? It's purely for testing purposes.

Many thanks again, Elin

olivertam commented 2 years ago

Hi Elin,

Thank you for your interest in the software. The tool is not yet published, but much of the methodology is based on TEtranscripts. I have contacted the person who is actively developing the tool, and they indicated that we don't use a random seed for the quantification. While we take a closer look at the code, could you confirm that you are using the latest version of TElocal? If you like, please feel free to send us your test dataset, and we are happy to see if we can replicate the issue on our end.

Thank you again.

ax-ekk commented 2 years ago

Hi Oliver,

sorry, my mistake! The inconstancy was not caused by TELocal but due to an upstream program that has some (undocumented) random seed selection.

Many thanks, also for the reference. Best Elin

ax-ekk commented 2 years ago

Hi again,

for completeness:

It was the order of the reads in the input that caused TELocal to return those different results.

I used bbduk to trim my reads and although bbduk returns the exact same set of reads, the order it outputs them varies from run to run. If I later run TELocal on those files, I (occasionally) get slightly different results.

In my small tests data it's about 1 in 10 runs that the TELocals results differ between bbduk runs.

It's not a big difference, on the one full data set where I compared a random bbduk order with the name sorted order I got ~35 genes/TEs that differed with a max of 2 reads (total number of counted reads ~4.5 million). It surely will not effect the conclusion of the data analysis, but can be confusing in a testing/reproducibility setting.

I guess one solution is to always use name sorted bam files as input.

Many thanks Elin

olivertam commented 2 years ago

Hi Elin,

Thank you for the detailed description. We will take a look on our end to see if the order of reads in the BAM should have affected quantification.

All the best.