tyjo / coptr

Accurate and robust inference of microbial growth dynamics from metagenomic sequencing
GNU General Public License v3.0
16 stars 5 forks source link

Location for OriC? #1

Closed cdiener closed 3 years ago

cdiener commented 3 years ago

Hi,

Is there a way to also return the location for the origin of replication?

tyjo commented 3 years ago

It is possible to return the estimated OriC location for CoPTR-Ref, but it would take some work to translate it to a genomic coordinate. The reason is that CoPTR-Ref models read density along the unit interval [0, 1]---after a filtering step. Because of the filtering step, converting the OriC estimate from the interval [0, 1] might require additional bookkeeping.

Having said that, I can create a separate branch with an experimental --oriC flag---if this is something you're interested in.

cdiener commented 3 years ago

Yep, we would be interested in that. Would a similar strategy work for contigs? If you know which contig a bin comes from the same should work there as well I would guess...

Starting with the ref-based ones would be great for now though! Thanks!

tyjo commented 3 years ago

The contig method is slightly more complicated. The contig method ranks bins based on estimated distance from OriC. The higher the rank, the closer to OriC. But the ranking can be noisy. I suspect you may need an additional module to make a prediction from the ranking.

Let me look into this. I'll create a new branch that outputs OriC predictions for the ref-based method, and the ranking for the contig-based method. I should be able to commit new code by the end of this week.

cdiener commented 3 years ago

That sounds great. No hurry though. Thanks a lot for your help!

tyjo commented 3 years ago

I created a new branch called oriC. It adds an --oriC flag to the estimate step that outputs an additional csv file with two columns. The first column is the reference genome id. The second column depends on whether the genome is complete or an assembly.

For complete genomes the second column is the estimated OriC position.

For assemblies, the column is a tab separated list of the reordered bins from the assembly. The format is [CONTIG-ID]-[POSITION]. The bins are ordered by decreasing distance to the replication origin. So bins at the end are closer to the origin, and bins near the beginning are further.

There are a couple caveats for the assembly output. First is that the ordering is noisy. Individual bins may be misplaced, but the overall trend should be with proximity to replication origin. Second, small contigs less than 10Kb are excluded from the reordering. Finally, depending on completeness the assembly may not contain the replication origin.

Let me know if I can be of further help. I am happy to support this module, and am curious about the results from the bin reordering.

cdiener commented 3 years ago

Thank you so much that's awesome. I'll run it for some of our data and get back to you!