Closed leoisl closed 12 months ago
I think we probaby need to increase this default value
float max_relative_gene_coverage { 10 };
to 100 or something, as they might be on plasmids.
I see this flag has also been added to map
. I don't understand how one would auto-update the error rate for sample $n+1$ for map
? Sorry if I am missing something obvious.
Also, my two cents: I feel it might be better to make auto-update the default - i.e. keep the current behaviour - and allow a user to turn it off, thus acknowledging their data isn't clean.
I think we probaby need to increase this default value
float max_relative_gene_coverage { 10 };
to 100 or something, as they might be on plasmids.
Done in https://github.com/rmcolq/pandora/pull/335/commits/08354f57ebb55fad34490d8905bb12ddd040d5b0
I see this flag has also been added to
map
. I don't understand how one would auto-update the error rate for sample n+1 formap
? Sorry if I am missing something obvious.
Indeed it is not useful to update the error rate in map
for now. In the near future we are planning to make map
be able to map several samples instead of one, so it could be useful in this case. However, the kmer coverage model could be updated and is later used when getting the ML path in map
: https://github.com/leoisl/pandora/blob/08354f57ebb55fad34490d8905bb12ddd040d5b0/src/map_main.cpp#L365-L368
Actually I just realised in discover
we receive several samples, but we don't update the error model nor the kmer coverage model at all (i.e. we don't have a call to estimate_parameters()
). Do you think this is some sort of bug? Should we add this to discover
?
Also, my two cents: I feel it might be better to make auto-update the default - i.e. keep the current behaviour - and allow a user to turn it off, thus acknowledging their data isn't clean.
Agreed, done in https://github.com/rmcolq/pandora/pull/335/commits/658a386f1a9ed76800bb69c948d42f7252217acc
Actually I just realised in discover we receive several samples, but we don't update the error model nor the kmer coverage model at all (i.e. we don't have a call to estimate_parameters()). Do you think this is some sort of bug? Should we add this to discover?
Hmmm I don't know. discover
has changed a lot since I was last working on it. I think last I was working on it (before the racon change) it wasn't able to take more than one sample so this wouldn't have been a problem no? But if we're going to allow multiple samples, probably better to mirror what map
and compare
do. Though, we run racon on every PRG anyway right, so not sure it would make a difference? Would only make a difference if we were relying on kmer coverage in order to trigger running racon (which might be wise to do at some stage)
I think last I was working on it (before the racon change) it wasn't able to take more than one sample so this wouldn't have been a problem no?
No, it wouldn't.
But if we're going to allow multiple samples, probably better to mirror what
map
andcompare
do. Though, we run racon on every PRG anyway right, so not sure it would make a difference? Would only make a difference if we were relying on kmer coverage in order to trigger running racon (which might be wise to do at some stage)
Updating the kmer coverage model right now makes no difference on discovery, but updating the error rate does make, as discovery
will now not use a single fixed error rate, but an error rate that is auto updated after processing a sample. I chose to do this to indeed keep consistency with the other 2 commands, implemented in https://github.com/rmcolq/pandora/pull/335/commits/e827462bd2d887a06d29483036ac9db1f0f77d36 . The option --dont-auto-update-params
was also added to discovery, in case you want to fallback to the previous behaviour.
Hey @iqbal-lab , @mbhall88 , @Danderson123 I think this is ready to be merged, please take a look and state your corrections or further reviews, if any.
Happy to merge after the above is discussed.
ok to merge
This PR:
n
, the error rate and the model for kmer coverages would be automatically updated, based on the mapping of the reads from samplen
to the PanRG. These updated values would then be used to map samplen+1
. If samplen
had some issues that caused its error rate to be unusual or its mapping to be unusually low, this could impact negatively samplen+1
mapping. In general, auto-updating the error rate and kmer model would likely produce better results if the input data is well cleaned and has no serious issue. Unfortunately, this might not always be the case, and most users don't check well the input data. This is now an optional parameter, which defaults to not do this auto update:pandora
. This is basically done by looking at the mean coverage of the gene and comparing with the values given for these 3 params: