harry-thorpe / piggy

Pipeline for analysing intergenic regions in bacteria
GNU General Public License v3.0
37 stars 7 forks source link

failed to produce IGR presence absence matrix #22

Closed nwheeler443 closed 4 years ago

nwheeler443 commented 5 years ago

Hi, I'm trying this out on a small set of 13 genomes. The analysis fails at the matrix production step: IGR cluster files created. Doing all-vs-all IGR cluster blast search... all-vs-all IGR cluster blast search completed Merging IGR clusters... IGR clusters merged 9235 IGR clusters found after merging Producing IGR presence absence matrix... Input file doesn't exist: /home/ubuntu/noncoding/piggy_out/cluster_intergenic_files/Cluster_129.fasta failed to produce IGR presence absence matrix

I have tried this on a few different machines, and get failure due to the same file. Do you have any idea why?

Thanks in advance for your help!

harry-thorpe commented 5 years ago

Hi,

I haven't seen that before. Are you able to share your data? If so I could have a look into it for you.

Alternatively, have you tried running on the four genomes in example_data?

nwheeler443 commented 5 years ago

The example data worked, which suggests it's an issue with my input files. Could it be that my genomes are too divergent? They're different serovars of Salmonella.

I've uploaded the input I'm using to this folder: https://www.dropbox.com/sh/aajz5s4oum5j7ti/AABfqVhiz9F4TKl_7__wncVra?dl=0

harry-thorpe commented 5 years ago

Hi,

I have had a look into this, and yes it is due to the format of the gff files. I have attached a fragment of one of the gffs, and the resulting sequence tag. Piggy first searches for CDS tags from the gff, and then goes forward and backwards to the next CDS to determine the IGRs. When it's doing this it assumes the CDSs all have an ID - and some of them don't (e.g. if you go backwards from AY509003.12 to the next CDS you get to the line: AY509003 EMBL CDS 1 552 . - 0 Parent=AY509003.5). This CDS doesn't have an ID, and piggy uses the IDs to name the IGRs. In these cases the IGRs can't be named properly and so their files are not created - hence the error. I think the easiest way to sort this out is to re-annotate the gffs with prokka. This shouldn't take long and should give annotation which will work with piggy. out_fragment.txt

nwheeler443 commented 5 years ago

Thanks! I found that Roary fixed this issue as part of its workflow, so I could just use the gffs in the "fixed_input_files" folder from the Roary output to resolve the issue. Thanks for your quick help :)

harry-thorpe commented 5 years ago

Interesting - piggy should actually use these automatically if they are available.