glogsdon1 / sunk-based_assembly

14 stars 2 forks source link

Questions on SUNK and the reference #1

Open xie186 opened 3 years ago

xie186 commented 3 years ago

Thank you for sharing the code. I really appreciate it. I have a few questions: 1) how do you define SUNK using Jellyfish (I assume)? 2) what does parentRead here https://github.com/glogsdon1/sunk-based_assembly/blob/2f52998725ace7c63c70ab100e5cd49268e80496/run_sharedSUNKs.sh#L8 3) what's the purpose of doing repeatmasker here https://github.com/glogsdon1/sunk-based_assembly/blob/2f52998725ace7c63c70ab100e5cd49268e80496/run_sharedSUNKs.sh#L46 4) you run a snakemake pipeline at the end, could you please share it? If not, could you please briefly describe each of the rules. 5) if I want to cite this repo, is there a paper I should cite?

glogsdon1 commented 3 years ago

Hi Shaojun,

Sorry for the delay in response, as I took some time off for Thanksgiving.

To answer your questions:

  1. We define a SUNK as a k-mer that occurs once (+/- 2 SD) per fold sequencing coverage. For example, if you have 30-fold sequencing coverage, then a SUNK would be the k-mer that occurs 30 times +/- 2 SDs in the dataset. You can determine the distribution of k-mers using jellyfish histo.

  2. I define a "parentRead" as the read that starts the tiling path. This might be a read that is anchored into unique space on one side of the gap you are trying to traverse.

  3. The RepeatMasker part of the code is not absolutely necessary, but I included it for validation of the assembly at the very end. One of the ways we check that the assembly is correct is by lining up RepeatMasker annotations for each read in the assembly to make sure that the overlapping regions match. This is more helpful in regions with varied repeat content (for example, segmental duplications or unique sequences) rather than at the centromere where the entire region is usually annotated as alpha-satellite.

  4. Thanks for pointing out that the snakemake was missing! I have now pushed it to the repo. It can be found in the scripts folder. I should mention that we are planning to update and refine this snakemake in the coming months.

  5. Yes, you can cite our bioRxiv piece here: https://www.biorxiv.org/content/10.1101/2020.09.08.285395v1. Thanks!