mcfrith / last-rna

MIT License
48 stars 6 forks source link

Questions on Last-train in case of multiple samples #6

Closed hd00ljy closed 5 years ago

hd00ljy commented 5 years ago

Hello!

I am analyzing multiple nanopore human WGS data.

I noticed that the training results are slightly different between samples.

I also tried merging FASTA from multiple samples and running last-train with the merged FASTA.

This also gave a slightly different result compared to the results from individual samples.

Q1 : Is it better to get the training results from merged FASTA than using the training results for individual samples for the matched samples?

Q2 : If the training result from merged FASTA is the better option, is there any saturation point where increasing the number of samples does not affect the training results significantly?

With regards Jinyoung

mcfrith commented 5 years ago

I suspect these slight differences in the results are not significant. So it doesn't matter: do whatever is most convenient.

If you have datasets that come from different versions of the sequencing tech, or different base-callers, then it's best to run last-train on them separately.

If you give last-train more than 1 million bases of query sequence data, it only uses a random 1-million-base sample of it. So that's a kind of saturation. (Probably 1 million is overkill, and much less is OK.)