set up to analyze multiple PacBio amplicons

jbloom commented 4 years ago

The ./data/PacBio_amplicons.gb file now contains all the different potential amplicons with appropriate names.

The process_ccs.ipynb reads in this full set of amplicons as potential targets.

The ./data/README.md has been updated to better describe this and the other input data.

Note: the full pipeline is not yet set up to handle multiple amplicons, so will break somewhere midway through process_ccs.ipynb.

tylernstarr commented 4 years ago

In the notebook process_ccs.ipynb, the schemes of the amplicon constructs appears to have some bugs -- perhaps this is past where you did the troubleshooting?

the sequence annotations in the images start at -149 instead of 1
by eye, the gene lengths don't seem to match the actual gene construct lengths -- e.g. GD-Pangolin is 42 nt longer than HKU3-1 but this deletion is not evident in the mapped image, whereas LYRa11, which should only be 3 nt shorter than GD-Pangolin, appears substantially shorter. I spot checked sequecnes in the PacBio_amplicons.gb file and they appear correct there
the last two images are rendered without the amplicon label

jbloom commented 4 years ago

@tylernstarr, thanks for catching these. The redundant README lines are now removed in e42d90d.

As far as the site mis-labeling in the images, this is actually a bug in the dna_features_viewer. They are not actually wrong, the tick labels are just rendered wrong. I've submitted a pull request to dna_features_viewer (see here) to fix that, so once they merge that request we can fix the numbering in the images.

I'm pretty sure the lengths are correct? It's just that the labeling is not very clear. The labels are above the images, so GD-Pangolin is actually the second one and HKU3-1 is actually the third one: and GD-Pangolin is longer than HKU3-1 as expected when you notice this. I agree the titles are not ideally located and the title is missing for the last one, but I think they are all correct just badly formatted.

If OK with you, I'd suggest with merge this even with the problematic image formatting, and then when the numbering is fixed by my dna_features_viewer pull request, I can work on re-formatting the titles too. But it should not matter for actual analyses.

jbloomlab / SARS-CoV-2-RBD_DMS

set up to analyze multiple PacBio amplicons #5