hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Posterior and resolution fix #147

Closed EricR86 closed 3 years ago

EricR86 commented 3 years ago

This PR should fix the --resolution option to work with posterior output. Notably, the posterior.code output did work before since it relied on the bed_write function.

To establish a test and ground truth for this change, the data testcase had to be changed to a resolution >1bp, and it was changed to 10 in my particular case. Other testcases didn't have sufficiently different data to detect this problem otherwise. The posterior output probability list (in probs) was written out to file for every GMTK frame (which was 10bp long sections) as a BED file at 10 bp intervals for every datapoint. To preserve the previous run length encoding, this established reference was merged with bedTools with the command groupBy -i new_baseline.0.bed -g 1,4 -c 2,3 -o min,max | awk -v OFS="\t" '{print $1,$3,$4,$2}' > posterior0.0.merged.bed.

The code changes were then compared against this established baseline and the start and end coordinates were verified with the viterbi output and also ensured no probabilities/datapoints were missing.

There is a consideration to change the data testcase to 10bp resolution since it also runs posterior already.

michaelmhoffman commented 3 years ago

I believe we discussed merge without review until 24 Nov. Let me know if you need me to review anyway.

EricR86 commented 3 years ago

This is more for record keeping of changes and how/why they were made. It's possible that the long description should have gone into the commit message and the commits put on a separate develop branch (for regression tests) instead.