pinellolab / CRISPRme

Other
18 stars 8 forks source link

A few questions... #57

Closed tcrevier closed 4 months ago

tcrevier commented 6 months ago

Hi all,

Thanks for the tool! I have a few questions

1) Are the _alt and _random files from the hg38 genome used as part of the search with the standard VCF sets? If so, can you explain how they are identified in the output?

2) I am interested in using some other VCF datasets, can you provide any information on what the tools support for the VCF files? I am planning to format them the same as you have with your script for cleaning up the gnomad 3.1 dataset, but I would prefer to not convert them to multi-allelic records like you are doing because of the infromation loss -- is it required?

Thanks,

Tom

ManuelTgn commented 6 months ago

Hi @tcrevier,

Thank you for your questions. Here are my responses:

1) Yes, when performing a CRISPRme search, all FASTA files within the directory specified through the --genome flag are considered. Therefore, if the selected directory contains _alt and _random chromosomes data, they will be included in the search. However, they won't be enriched with variants unless a VCF containing variants mapped on such chromosomes exists in the VCFs directory. If these VCF files are available, the VCF file names must contain the linked chromosome name (e.g., chrUn_KI270588v1.vcf.gz) as for canonical chromosomes. If these VCF files are not available, CRISPRme will proceed with searching targets on their reference sequence, similar to running CRISPRme in "reference-only" mode. Targets found on these chromosomes will be reported in the final report, with the chromosome column indicating the non-canonical chromosome name (e.g., chrX_KI270880v1_alt, instead of chrX).

2) CRISPRme does not impose many constraints on the VCF that can be used during the search. However, there are some requirements. The input VCFs must be in VCF format >= v4.2 and must be phased, listing the samples of origin of each genotype. This allows CRISPRme to perform a haplotype-aware search without introducing recombinants not observed in the input dataset. Regarding the conversion of gnomAD VCFs, you may want to split each individual multi-allelic site into single alleles. The resulting VCF will still be supported by CRISPRme as long as they are phased.

I hope this clarifies things for you. If you have any further questions or need more information about CRISPRme, please don't hesitate to ask.

Thanks for using CRISPRme in your research.

Best, Manuel

tcrevier commented 6 months ago

@ManuelTgn

Hi Manuel,

Thanks for the response and clarifications...

  1. That makes total sense and is what I was expecting the answer to be.

  2. Can you clarify a little more? In your script to clean up the gnomAD3.1 files, you are converting them to multi-allelic (which breaks the haplotype-aware search, BTW). The gnomAD files come in bi-allelic format. I want to make sure that conversion to multi-allelic format is not required by CRISPRme because of the impact it has on the phase information. Can you confirm that it is not required?

thanks again,

Tom

khl0798 commented 6 months ago

Hi Manuel,Tom,

Sorry to disturb everyone, but may I ask a related question in this context? I'm also curious to know what the values in the columns corresponding to each sample in a VCF file represent. I often see content such as 0|0, 1|0, and I'd really like to understand what these mean. And how do these column values affect the output results?

If I could receive responses from all of you, I would be truly grateful. Looking forward to your reply.

Kong

tcrevier commented 6 months ago

@khl0798

My understanding is

These are the genotypes of each of the sample IDs. The meaning is

For the VCF sets relevant here, you should only be seeing 0/0 (the variant is not present in the population the sample represents) or 0/1 (the variant is present in the population). If you are using VCFs with multi-allelic format (multiple alleles for a single row) the meaning is less defined, it is supposed to be 0/n or 0/0. 0/n would indicate that it has the nth allele. It is not clearly defined what would happen in the case that the population has more than one allele present, which happens in the larger variant sets.

This information is critical to one of the coolest features of CRISPRme. It is aware of which population(s) a variant is present in and does a haplotype aware search. It only considers an off-target which requires more than one variant if all of the variants can be present in a population rather than just blindly considering all of the variants.

@ManuelTgn may have more to add, that is just what I know...

Tom

ManuelTgn commented 6 months ago

Hi @khl0798 and @tcrevier,

As Tom pointed out accurately values like 0/0, 0/1, 1/1 refers to genotype statuses assigned to each sample during variant calling from WGS sequencing data.

CRISPRme takes into account the genotype of each sample to perform a haplotype-aware search. This approach ensures that CRISPRme doesn't generate combinations of variants that aren't observed in the actual population. Let me illustrate this with an example:

Imagine we have two samples, S01 and S02, with the following variants:

REF ALT S01 S02
A G 0/1 0/0
T G 1/1 0/1
C T 0/0 1/0

Suppose the first two variants create a new NGG PAM (preceded by a suitable target sequence for our input sgRNA). In this case, the target sequence will be associated only with sample S01 because sample S02 lacks the first variant.

In the scenario of multiallelic alleles, it would look like this:

REF ALT S01 S02
A G,T 0/1 0/2

This indicates that S01 has the variant allele G, while S02 has the T allele. CRISPRme handles such situations similarly to the former case.

I hope this explanation clarifies how CRISPRme handles these cases.

Best, Manuel