mckennalab / FlashFry

FlashFry: The rapid CRISPR target site characterization tool
Other
63 stars 10 forks source link

problem with genome fasta files with chromosome description #25

Closed eranbio closed 2 years ago

eranbio commented 2 years ago

Thanks for this excellent tool! I found out that if the reference genome files includes chromosme description in the fasta header files and off-target positions output are asked in the discovery stage, then the program fails in the scoring stage. I manually removed the chromosome descriptions and re-indexed, but wondered if there is an easy fix for that in the code. Thx

aaronmck commented 2 years ago

Could you give a little more detail about the FASTA headers you have? This is often a problem with spaces / special characters in the header. The code is somewhat flexible but this is an easy break point.

eranbio commented 2 years ago

Examples for the fasta headers. Haven't made many trials. For me simply removing the entire description (everything following the first token, e.g. CM017761.1) worked. >CM017761.1 Acer yangbiense isolate Malutang-1-2009seedling ecotype Malutang chromosome 1, whole genome shotgun sequence >CM017762.1 Acer yangbiense isolate Malutang-1-2009seedling ecotype Malutang chromosome 2, whole genome shotgun sequence >CM017763.1 Acer yangbiense isolate Malutang-1-2009seedling ecotype Malutang chromosome 3, whole genome shotgun sequence

aaronmck commented 2 years ago

I'll try out some of these examples, though commas and other special characters in the contig name are probably the issues here. If the first ID is unique (CM017761.1 for example) it might be worth splitting on that for the moment. I'll let you know when there's something in to address this.

aaronmck commented 2 years ago

I'm going to close this, as it's somewhat hard to protect against all possible contig names, but let me know if there's still a specific problem to fix