janka2012 / digIS

Pipeline detecting distant and putative novel insertion sequences in prokaryotic genomes
MIT License
9 stars 0 forks source link

No reports of IS3, IS4 and IS5 in the .sum file #7

Open fgaudilliere opened 2 years ago

fgaudilliere commented 2 years ago

Hello,

I used digIS on a large number of bacterial genomes, and while IS belonging to the IS3, IS4, IS5, IS200/605 and ISNCY are reported in the .csv and in the .gff file, they're not listed as belonging to these families in the .sum file. Is there a reason for this?

Best, Flora

janka2012 commented 2 years ago

Hi @fgaudilliere, thanks for reporting this. In general, I do not see a reason why this should happen. May I ask you to share a with me at least one bacterial genome in which you see this is happening and the .csv/.gff and .sum file?

fgaudilliere commented 2 years ago

Here are the files for one of my genomes: digIS_issue.zip

janka2012 commented 2 years ago

@fgaudilliere I found what is the issue. IS3, IS4, IS5 and IS200/605 contain multiple subfamilies (see here) and we refer to them e.g. IS3_IS2. Then, if the found record contains a _ character, it is reported in the other group of detected IS elements as it is not family but subfamily level. I can fix this but would be nice to see your perspective on how it would make sense the most. However, if this is blocking you in any way, feel free to create your own summary statistics from the .csv/.gff output file. I hope this helps :)

fgaudilliere commented 2 years ago

Thanks for the quick answer! I think what would make the most sense to me would be to regroup the IS3, IS4, IS5 and IS200/605 copies by family in the .sum report: that way it remains an overview without too many details, but if someone is interested in the subfamily, the information is still available in the .csv and .gff files. Yes, I'm writing a small script to extract data from the .csv file :)