marcpaga / sturgeon

Sturgeon is a CNS neural network classifier
https://www.nature.com/articles/s41586-023-06615-2
Other
28 stars 8 forks source link

Hg38 support #2

Closed rowanhowellGE closed 1 year ago

rowanhowellGE commented 1 year ago

Hi,

I have data that is aligned to the hg38 reference genome. I can see that sturgeon requires bams aligned to the T2T reference but I was wondering if there is any work around that avoids having to re-align the entire dataset? For example, could UCSC liftover be used to change the coordinates in the bed file created by inputtobed?

Thanks,

marcpaga commented 1 year ago

The way sturgeon works is that it will check for specific coordinates on the bam file to construct the bed file. These coordinates are based on the T2T reference. Changing the inputtobed bed file coordinates will not do anything since the model just needs the methylation status to be fed in the correct order.

There are two solutions here:

  1. Use liftover to change the coordinates to hg38 reference genome of this file: sturgeon/include/static/probes.bed. This bed file contains the coordinates that have to be checked on the bam file.

  2. Use this tool called CrossMap to re-align the bam file (which is essentially liftover for bam files).

Solution 1 seems to be the best here, as we only have to create a bed file for each reference version. I will try to add this during the following days.

marcpaga commented 1 year ago

I have added a new bed file that it is now compatible with hg38. To use it you can pass --reference-genome hg38 as an argument to inputtobed (you will have to update to the new sturgeon version). I have tested this with a few of our samples, and I do not see any differences in performance. Still, the liftover lead to a loss of 226 probes and all our testing is done on T2T.

Could you perhaps re-align a handful of your samples as well to T2T to compare to hg38? And if possible, let me know if you find any strange behaviours?

If this works for you, please close the issue. Thanks

rowanhowellGE commented 1 year ago

Hi,

Thanks for the rapid response and fix. I have tried aligning one of my samples to T2T and Hg38 and using the new probe set I got the correct classification with both alignments and same confidence level (both 0.999). I will continue working with the Hg38 version as that is what I am using for other analyses.

Just before I close, I am wondering slightly about these 226 probes that were missed during liftover. Presumably all 450k probes have a Hg38 coordinate that could be determined without performing liftover on the probe set. Would that resolve the missing probes?

Thanks

marcpaga commented 1 year ago

I have been looking into liftover issues. The original probe coordinates are from the Hg19 reference genome. The coordinates that we had until now came from Hg19->T2T->Hg38 (loss of 226), I have now done the liftover directly from Hg19, which leads to only a loss of 26. You can get this new bed by updating to Sturgeon 0.3.6.

According to the UCSC website, the missing probes coordinates are deleted in the new reference genomes.

lidd77 commented 1 month ago

hey, which align tool should I use to map reads to hg38/T2T ? minimap2 is best ?

marcpaga commented 1 month ago

We always used the minimap2 version that comes bundled with guppy/dorado.