iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Adds parameters dealing with auto-updating error rate and kmer model, and for gene filtering based on coverage #335

Closed leoisl closed 12 months ago

leoisl commented 1 year ago

This PR:

  1. Adds a parameter to deal with auto-updating error rate and kmer model. Before this PR, after processing sample n, the error rate and the model for kmer coverages would be automatically updated, based on the mapping of the reads from sample n to the PanRG. These updated values would then be used to map sample n+1. If sample n had some issues that caused its error rate to be unusual or its mapping to be unusually low, this could impact negatively sample n+1 mapping. In general, auto-updating the error rate and kmer model would likely produce better results if the input data is well cleaned and has no serious issue. Unfortunately, this might not always be the case, and most users don't check well the input data. This is now an optional parameter, which defaults to not do this auto update:
  --auto-update-params        Automatically update error rate and kmer coverage model parameters based on the
mapping of the previous sample. It could potentially generate more accurate results in some very specific cases, 
first run should always have this flag off
  1. Adds 3 parameters to control when a gene should be filtered out due to too low or too high coverage. This removes hard-coded values that were previously in pandora. This is basically done by looking at the mean coverage of the gene and comparing with the values given for these 3 params:
  --min-abs-gene-coverage FLOAT
                              Minimum absolute mean gene coverage to keep a gene. Given the coverage on the kmers of the maximum
likelihood path of a gene, we compute the mean gene coverage and compare with the value in this parameter. If the mean is
lower than this parameter, the gene is filtered out, e.g. if this parameter value is 3, then all genes with mean <3 will be filtered
out.
  --min-rel-gene-coverage FLOAT
                              Minimum relative mean gene coverage to keep a gene. This is a proportion, between 0.0 and 1.0. Given the
coverage on the kmers of the maximum likelihood path of a gene, we compute the mean gene coverage and compare with the
value in this parameter and the global coverage. If the mean is lower than the computed value, the gene is filtered out, e.g. if
this parameter value is 0.05, then all genes with mean < 5% of the global coverage will be filtered out.
  --max-rel-gene-coverage FLOAT
                              Maximum relative mean gene coverage to keep a gene. Given the coverage on the kmers of the maximum
likelihood path of a gene, we compute the mean gene coverage and compare with the value in this parameter and the global
coverage. If the mean is higher than the computed value, the gene is filtered out, e.g. if this parameter value is 10, then all
genes with mean >10 times the global coverage will be filtered out.
iqbal-lab commented 1 year ago

I think we probaby need to increase this default value

float max_relative_gene_coverage { 10 };

to 100 or something, as they might be on plasmids.

mbhall88 commented 1 year ago

I see this flag has also been added to map. I don't understand how one would auto-update the error rate for sample $n+1$ for map? Sorry if I am missing something obvious.

Also, my two cents: I feel it might be better to make auto-update the default - i.e. keep the current behaviour - and allow a user to turn it off, thus acknowledging their data isn't clean.

leoisl commented 1 year ago

I think we probaby need to increase this default value

float max_relative_gene_coverage { 10 };

to 100 or something, as they might be on plasmids.

Done in https://github.com/rmcolq/pandora/pull/335/commits/08354f57ebb55fad34490d8905bb12ddd040d5b0

leoisl commented 1 year ago

I see this flag has also been added to map. I don't understand how one would auto-update the error rate for sample n+1 for map? Sorry if I am missing something obvious.

Indeed it is not useful to update the error rate in map for now. In the near future we are planning to make map be able to map several samples instead of one, so it could be useful in this case. However, the kmer coverage model could be updated and is later used when getting the ML path in map: https://github.com/leoisl/pandora/blob/08354f57ebb55fad34490d8905bb12ddd040d5b0/src/map_main.cpp#L365-L368

Actually I just realised in discover we receive several samples, but we don't update the error model nor the kmer coverage model at all (i.e. we don't have a call to estimate_parameters()). Do you think this is some sort of bug? Should we add this to discover?

leoisl commented 1 year ago

Also, my two cents: I feel it might be better to make auto-update the default - i.e. keep the current behaviour - and allow a user to turn it off, thus acknowledging their data isn't clean.

Agreed, done in https://github.com/rmcolq/pandora/pull/335/commits/658a386f1a9ed76800bb69c948d42f7252217acc

mbhall88 commented 1 year ago

Actually I just realised in discover we receive several samples, but we don't update the error model nor the kmer coverage model at all (i.e. we don't have a call to estimate_parameters()). Do you think this is some sort of bug? Should we add this to discover?

Hmmm I don't know. discover has changed a lot since I was last working on it. I think last I was working on it (before the racon change) it wasn't able to take more than one sample so this wouldn't have been a problem no? But if we're going to allow multiple samples, probably better to mirror what map and compare do. Though, we run racon on every PRG anyway right, so not sure it would make a difference? Would only make a difference if we were relying on kmer coverage in order to trigger running racon (which might be wise to do at some stage)

leoisl commented 12 months ago

I think last I was working on it (before the racon change) it wasn't able to take more than one sample so this wouldn't have been a problem no?

No, it wouldn't.

But if we're going to allow multiple samples, probably better to mirror what map and compare do. Though, we run racon on every PRG anyway right, so not sure it would make a difference? Would only make a difference if we were relying on kmer coverage in order to trigger running racon (which might be wise to do at some stage)

Updating the kmer coverage model right now makes no difference on discovery, but updating the error rate does make, as discovery will now not use a single fixed error rate, but an error rate that is auto updated after processing a sample. I chose to do this to indeed keep consistency with the other 2 commands, implemented in https://github.com/rmcolq/pandora/pull/335/commits/e827462bd2d887a06d29483036ac9db1f0f77d36 . The option --dont-auto-update-params was also added to discovery, in case you want to fallback to the previous behaviour.

leoisl commented 12 months ago

Hey @iqbal-lab , @mbhall88 , @Danderson123 I think this is ready to be merged, please take a look and state your corrections or further reviews, if any.

iqbal-lab commented 12 months ago

Happy to merge after the above is discussed.

iqbal-lab commented 12 months ago

ok to merge