vgteam / sequenceTubeMap

displays multiple genomic sequences in the form of a tube map
MIT License
177 stars 24 forks source link

Add a script that can batch import and explain Lancet data in the November format #428

Closed adamnovak closed 3 months ago

adamnovak commented 3 months ago

This fixes #427 sort of by making a BED that explains what variant is supposed to be seen.

I ran it like this:

cd ~/workspace/sequenceTubeMap
mkdir -p webhost
cd webhost

~/workspace/sequenceTubeMap/scripts/prepare_lancet_output.sh ~/Downloads/STM_DataShare_Nov07_2023\ 2/ ./lancet_2023-11-07

cd lancet_2023-11-07

python3 -m http.server

Then I could put http://[::]:8000/index.bed into my local tube map as the BED file and browse the data.

But, I think the data provided is not exactly what I really want to look at. It looks liek the called tumor variants are not actually in the graphs. For example, at chr1:38506973-38506974 I am suppsoed to see a Tumor-specific CTGGAATCCAGCAGCCCAGACTTCCACATCATAATTTTCTGGGGCAATGGTTTTCAAACTTCACTGTACG -> C DEL variant. But the graph doesn't show a large deletion. Instead, that deleted sequence occurs 1 base into the leftmost node here, and I can see the softclips in the aligned tumor reads at their left ends, where they would read over the deletion edge that isn't present. Here's a screenshot of the softclips, with the tumor reads in red:

Screenshot 2024-05-03 at 16 56 40

So I now have a way to generate and host tumor-normal Lancet examples, including examples for larger indels, but the graphs coming from Lancet don't really seem to be the right ones.

adamnovak commented 3 months ago

OK, I talked to Rajeeva and he told me that Lancet makes several graphs per variant. I added code to find the most centered one for each variant, and now I can indeed see this variant. Here's that deletion being taken by just the tumor reads:

Screenshot 2024-05-07 at 17 24 37

These are now probably usable examples.