phageParser / phageParser

A project to extract CRISPR information from open genetic data.
MIT License
111 stars 76 forks source link

Script to analyze BLAST output from host genome #62

Closed mbonsma closed 7 years ago

mbonsma commented 8 years ago

This is marked as a good first bug because, although it sounds complicated, it's computationally not as bad as it sounds - I think. There might be some tricky sections, but if a newcomer is attempting this, I'm available for questions anytime! Just comment in this issue.

Motivation

Research has suggested that there may be cases when CRISPR systems are used for something besides immunity to foreign DNA - perhaps they could be regulating the host genome, or they might simply be inactive. A clue that one of these things might be happening is if there are spacers that come from their own host genome. To this end, we need functions to (A) BLAST spacers against the host genome and (B) analyze the BLAST output. The first function is described in issue #60 and the second function is described here.


The function

Input:

The function should do the following:

Output:

The "Source" column contains the "CRISPR" or "non-CRISPR" flag. All other column headings are fields in the object that results from the BLAST-parser module.

lpix commented 8 years ago

I would like to try this out, but I am going to need help. I am going to update here with questions.

mbonsma commented 8 years ago

@Ipix, awesome!

lwgray commented 8 years ago

Hi @lpix I can also help if you run into any technical problems...

mbonsma commented 8 years ago

@lisajguo, I added a file with CRISPR locus start and end positions for each organism.

bhanu-sharma commented 7 years ago

Hey Is this issue still open? I would like to give it a try.

mbonsma commented 7 years ago

Hi @bhanu546, yes it's still open! Let me know if there's anything I can do to help you get started.

bhanu-sharma commented 7 years ago

Use BioPython's built-in BLAST parser to parse the output. This is also done in filterByExpect_all_v2.py, although I've noticed a few mistakes just now so be careful

In reference to this, what does the parse the output mean? Could you please explain the statement?

mbonsma commented 7 years ago

@bhanu546, it's a bit vague, I agree. By "output" I mean the output of BLAST - for example, the XML file linked here. I'll update it in the issue text as well.

Ky5595 commented 7 years ago

Hi @mbonsma Is this issue still open? I would like to solve it.

mbonsma commented 7 years ago

Hi @Ky5595, yes it is!

wgranados commented 7 years ago

Hello, I can look into this.

mbonsma commented 7 years ago

Woo @wgma00, thanks!

wgranados commented 7 years ago

I'm trying to create a sample XML file from #60 but I'm not sure what to pass as the second parameter to anticrisprblast.py . echo data/velvet-distinct-spacers.fasta python3 anticrisprblast.py <genome>, is what I've managed to figure out so far.

mbonsma commented 7 years ago

I think you want blast_spacer_to_host_genome.py instead of anticrisprblast.py, for starters, but it looks like blast_spacer_to_host_genome.py has a single demo file hardcoded and probably won't do anything interesting.

You could ignore #60 for now and use this example file if you want.

mbonsma commented 7 years ago

And I'll make an issue for blast_spacer_to_host_genome.py to fix it up a bit.

wgranados commented 7 years ago

You mention that there should be an optional cut-off value, but I've read through the documentation you posted but I don't see any mention of a "cut-off" value. How is this to be incorporated? Also are there any bounds on this value?. I've started some work here

ayangromano commented 7 years ago

Hello! Just wondering what the status for this issue is. I would love to assist!

wgranados commented 7 years ago

I wrote up some starter code on my master branch, but I won't have time contribute today. So feel free to take over if you'd like.

ayangromano commented 7 years ago

Can you go over the 'tricky part' of the function? I've re-read what you wrote a few times, but am still not sure I understand it too well. I've also looked into extract_CRISPRdb.py, but don't know how this is ran. Maybe if I can see what the output looks like, I'll have a better understanding of how to assign certain hits as 'CRISPR' and 'non-CRISPR'. Thanks!

mbonsma commented 7 years ago

Hi @ayangromano, basically all this has to do is make sure that we're not counting 'hits' that are self-matches. When you blast a fragment of a bacterial genome against the whole genome, there will be one hit that is a self-match, and any other hits are the ones we're interested in.

extract_CRISPRdb.py is almost deprecated since we've moved to using a database, so probably the best thing to do instead would be to check for that spacer in the organismspacerrepeatpair table, get its genomic_start and genomic_end positions, and check if the BLAST match falls inside those positions.

For example, here's what one row of that table looks like in json format:

    "organism_spacer_repeat_pairs": [
        {
            "genomic_end": 1928488,
            "organism": 1923,
            "id": 1,
            "spacer": 1,
            "repeat": 1,
            "genomic_start": 1928414,
            "order": 1
        }

I realize this is a bit vague, so in the next few days I'll update this issue to reflect more recent development. But feel free to start and ping me with questions in the meantime.

mbonsma commented 7 years ago

I'm going to close this and redirect to #223 which is more current.