Closed mbonsma closed 7 years ago
I would like to try this out, but I am going to need help. I am going to update here with questions.
@Ipix, awesome!
Hi @lpix I can also help if you run into any technical problems...
@lisajguo, I added a file with CRISPR locus start and end positions for each organism.
Hey Is this issue still open? I would like to give it a try.
Hi @bhanu546, yes it's still open! Let me know if there's anything I can do to help you get started.
Use BioPython's built-in BLAST parser to parse the output. This is also done in filterByExpect_all_v2.py, although I've noticed a few mistakes just now so be careful
In reference to this, what does the parse the output mean? Could you please explain the statement?
@bhanu546, it's a bit vague, I agree. By "output" I mean the output of BLAST - for example, the XML file linked here. I'll update it in the issue text as well.
Hi @mbonsma Is this issue still open? I would like to solve it.
Hi @Ky5595, yes it is!
Hello, I can look into this.
Woo @wgma00, thanks!
I'm trying to create a sample XML file from #60 but I'm not sure what to pass as the second parameter to anticrisprblast.py
. echo data/velvet-distinct-spacers.fasta python3 anticrisprblast.py <genome>
, is what I've managed to figure out so far.
I think you want blast_spacer_to_host_genome.py
instead of anticrisprblast.py
, for starters, but it looks like blast_spacer_to_host_genome.py
has a single demo file hardcoded and probably won't do anything interesting.
You could ignore #60 for now and use this example file if you want.
And I'll make an issue for blast_spacer_to_host_genome.py
to fix it up a bit.
You mention that there should be an optional cut-off value, but I've read through the documentation you posted but I don't see any mention of a "cut-off" value. How is this to be incorporated? Also are there any bounds on this value?. I've started some work here
Hello! Just wondering what the status for this issue is. I would love to assist!
I wrote up some starter code on my master branch, but I won't have time contribute today. So feel free to take over if you'd like.
Can you go over the 'tricky part' of the function? I've re-read what you wrote a few times, but am still not sure I understand it too well. I've also looked into extract_CRISPRdb.py, but don't know how this is ran. Maybe if I can see what the output looks like, I'll have a better understanding of how to assign certain hits as 'CRISPR' and 'non-CRISPR'. Thanks!
Hi @ayangromano, basically all this has to do is make sure that we're not counting 'hits' that are self-matches. When you blast a fragment of a bacterial genome against the whole genome, there will be one hit that is a self-match, and any other hits are the ones we're interested in.
extract_CRISPRdb.py is almost deprecated since we've moved to using a database, so probably the best thing to do instead would be to check for that spacer in the organismspacerrepeatpair
table, get its genomic_start
and genomic_end
positions, and check if the BLAST match falls inside those positions.
For example, here's what one row of that table looks like in json format:
"organism_spacer_repeat_pairs": [
{
"genomic_end": 1928488,
"organism": 1923,
"id": 1,
"spacer": 1,
"repeat": 1,
"genomic_start": 1928414,
"order": 1
}
I realize this is a bit vague, so in the next few days I'll update this issue to reflect more recent development. But feel free to start and ping me with questions in the meantime.
I'm going to close this and redirect to #223 which is more current.
This is marked as a good first bug because, although it sounds complicated, it's computationally not as bad as it sounds - I think. There might be some tricky sections, but if a newcomer is attempting this, I'm available for questions anytime! Just comment in this issue.
Motivation
Research has suggested that there may be cases when CRISPR systems are used for something besides immunity to foreign DNA - perhaps they could be regulating the host genome, or they might simply be inactive. A clue that one of these things might be happening is if there are spacers that come from their own host genome. To this end, we need functions to (A) BLAST spacers against the host genome and (B) analyze the BLAST output. The first function is described in issue #60 and the second function is described here.
The function
Input:
Accession, CRISPR_ID, start, end
.The function should do the following:
Output:
accession_self-spacers.csv
(where "accession" is the NCBI accession number, i.e. NC_000853) with the following column headings:The "Source" column contains the "CRISPR" or "non-CRISPR" flag. All other column headings are fields in the object that results from the BLAST-parser module.