how to calculate the DR_count?

xxxlglm commented 1 month ago

Dear Dr. Sean,

I hope this message finds you well. I am writing to express my admiration for your recent publication in Current Biology titled "High viral abundance and low diversity are associated with increased CRISPR-Cas prevalence across microbial ecosystems." Your work has provided valuable insights into the relationship between viral abundance, microbial diversity, and the prevalence of CRISPR-Cas systems.

After thoroughly reading your article, I found myself intrigued by the methodology used to calculate the abundance of CRISPR-Cas systems. Unfortunately, the specific method of calculation was not explicitly detailed within the paper, which has left me with a few questions.

Could you kindly elaborate on how the CRISPR-Cas abundance was quantified in your study? Understanding the process would greatly assist my own research and potentially enhance my comprehension of CRISPR-Cas dynamics within microbial ecosystems.

I would greatly appreciate any guidance or additional resources you could provide on this matter. Your expertise in the field is highly respected, and any insights would be invaluable.

Thank you very much for considering my request. I look forward to the opportunity to learn from your work.

Warm regards, Dr Ge

s-meaden commented 1 month ago

Dear Dr Ge,

Thanks for your email and interest in the paper. My apologies if some of the details in the methods sections were not clear. In short, we measured the abundance of CRISPR arrays, rather than individual CRISPR-Cas systems. To do this we used the tools developed by my collaborator, CRISPRDetect3.0 (https://github.com/ambarishbiswas/CRISPRDetect_3.0) and metaCRISPRDetect (https://github.com/ambarishbiswas/metaCRISPRDetect_1.0), to identify CRISPR arrays.

We then used 1 million reads from each metagenomic sample, which we refer to as 'subsampled reads' in the paper, and map these to the arrays. From there we tally the number of reads that map to each array on a per sample basis. The sum of these tallies is used to calculate the overall abundance of CRISPR arrays per sample. I've copied the relevant parts of the methods section below.

Hope that helps and happy to talk further. All the best, Sean

Subsampled reads were then screened against this database using metaCRISPRDetect (https://github.com/ambarishbiswas/metaCRISPRDetect_1.0), which supports rapid identification of CRISPR arrays in short reads using user-provided reference repeat database as an extension of CRISPRDetect.34 Arrays with a likelihood score > 3 were added to the existing CRISPR reference database. Subsampled reads were then mapped to the reference database using blastn35 [parameters: -task blastn-short with default parameters]. CRISPR array quantification Abundance tables were generated for all spacers and direct repeat sequences by tallying the number of reads that mapped to a given DR with 100% sequence identity and coverage.

xxxlglm commented 4 weeks ago

Dear Dr. Sean,

I hope this message finds you well. I am truly grateful for your detailed explanation, which has helped me grasp a significant part of the process. However, I have two further queries that I believe could be clarified for a better understanding:

When mapping the reads, were they aligned to the entire CRISPR array, which includes both the spacer and the repeat, or were they specifically aligned to the direct repeat sequences only?

Regarding the count of reads that mapped to the direct repeats, is there a need for further normalization of these counts? For instance, would it be necessary to adjust the counts by dividing them by the sequencing depth or the length of the sequence to account for variations in sequencing coverage?

I am looking forward to your guidance on these points, as they are crucial for the accuracy and comparability of the results in our research.

Thank you once again for your time and expertise.

Warm regards, Dr Ge

s-meaden commented 4 weeks ago

Hi Dr Ge,

Regarding your first question, they were mapped to just the direct repeat and not the spacers. You may want to contact my collaborator Ambarish Biswas, who is the developer of metaCRISPRdetect, to discuss the best way of doing that or alternative approaches.

For your second question, we subsampled the sequences to an even depth before the mapping to account for variation in sequencing depth. While we didn't normalise against sequence length, that could be important depending on the biological question. We used the term 'CRISPR abundance' quite broadly in the paper as we can't distinguish (with this dataset or approach) between longer arrays or a greater number of short arrays. I'd imagine long-read sequencing might resolve some of those questions.

Best, Sean

xxxlglm commented 3 weeks ago

Dear Dr. Sean,

Thank you very much for your responses. I have one final question I'd like to ask you. In the article, it states that you used "subsample reads" to align with repeats using blastn, a method that is not commonly applied to the alignment of sequencing reads. I am curious about how you executed this process. If you could share your approach with me, or provide your code as a reference, I would be very appreciative.

Thank you very much.

Best regards, Dr Ge

s-meaden commented 3 weeks ago

Hi Dr Ge, I double checked this with Ambarish. He made a database of all the DR sequences identified with the tools mentioned in the paper, then screened the reads against this database with with blastn, default parameters:

blastn -task blastn-short -db DB_NAME -query QUERY_SEQUENCES -out OUTPUT_FILE -outfmt 6 ...

Regarding the 'subsample reads', these are the full length sequences. The subsampling just refers to the down-sampling to standard number (1 million per sample). Apologies if the wording was not clear.

Best, Sean

xxxlglm commented 2 weeks ago

Dear Dr. Sean, Thank you very much for your response. Based on your help, I have calculated the abundance of a sample.I utilized the DR database as the query sequence and the reads from the metagenomic sequencing of 'sample1' as the database. According to the command 'blastn -task blastn-short -query repeat.fasta -db sample_read -out blast.out.txt -perc_identity 100 -qcov_hsp_perc 100 -num_threads 10 -outfmt 6', I generated a matrix. Is this the correct format? Was the format like this before you provided it on GitHub as data.csv DR_id. count repeat1 0 repeat10061 4 repeat101 1 repeat10100 0 repeat10101 14 repeat10109 12 repeat10110 0 repeat10112 0 repeat10124 45 repeat10128 1 repeat10183 46

I hope you can give me some advice, this is very important to me. Best, Dr Ge

s-meaden commented 2 weeks ago

The format I provided in data.csv is the tally of the counts following the blast search. I'd suggest contacting Ambarish about exact formats for output tables as I only have the final count tables which contain read ID and repeat cluster ID. Feel free to give some more context about your work.

Thanks, Sean

xxxlglm commented 2 weeks ago

Thank you very much for your help, I have contacted him, thank you。 best， Dr Ge

s-meaden / Meaden_CB_2021

how to calculate the DR_count? #1