markrobinsonuzh / CrispRVariants

22 stars 4 forks source link

Custom Reference Sequence #7

Open rleenay opened 6 years ago

rleenay commented 6 years ago

Hello,

I have a quick data format question for the readstotarget function.

From my understanding, the "target" variable must be in a GRanges format, taken from the "gd" or "gdl" variable shown in your examples. I have a few interesting cases where I would like to generate a custom reference file (say after HDR insertion) and I was curious if it would be possible to remove the GRanges requirement on the "target" variable? That way a custom DNA character string could be used in it's place, allowing for HDR variants to be easily and rapidly mapped.

If this is a bit too much work, I can try to figure out a workaround! Thank you!

HLindsay commented 6 years ago

Hi,

The "target" needs to be a GRanges object in order to find the alignments that have mapped to the correct sequence. However, the mapping reference itself is arbitrary. For example, if you mapped to a fasta file containing a sequence named "hdr_reference", then your target would be something like GRanges("hdr_reference", IRanges(1, nchar(hdr_reference))) (with the locations either being the entire reference sequence or some subset of it). For a standard (bwa) pipeline with an arbitrary reference, make a fasta file of your custom reference sequence then index it and align to it as usual. Depending on how you set up your reference sequence, you might want to use the min.overlap parameter to allow alignments that don't completely cover the reference sequence - otherwise they are filtered out. I'd also suggest using chimeras = "merge" - this is an experimental parameter for linearising ambiguous alignments where possible - and if you used a paired guide set a large value for chimera.to.target - this determines how close a gap in the alignment must be to the cut site.

The reference sequence input to readsToTarget also doesn't actually need to match the reference genome sequence used for mapping, for example if you mapped to a standard genome but want to count nucleotide substitutions with respect to a known variant allele.

Hope this helps! Helen