medema-group / bigslice

A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data
GNU Affero General Public License v3.0
68 stars 38 forks source link

Parsing and Inserting gbks from MIBiG #59

Open nicholascdove opened 1 year ago

nicholascdove commented 1 year ago

Hi Satria,

Thanks for the great package. I'm having difficulty clustering gbks from MIBiG.

I made an input folder, downloaded MIBiG, and placed the gbks in the input folder.

mkdir -p bigslice_input/AgB_gbk/gbks bigslice_input/taxonomy

wget --no-check-certificate https://dl.secondarymetabolites.org/mibig/mibig_gbk_3.1.tar.gz
tar -xf mibig_gbk_3.1.tar.gz 
mv mibig_gbk_3.1/* bigslice_input/AgB_gbk/gbks

I did the same for my own gbks run through AntiSMASH.

mv data/*gbk bigslice_input/AgB_gbk/gbks

I also made a dummy manifest and taxonomy file (I don't use the sqlite db, I end up parsing it and joining taxonomy from a separate database).

echo -e "# Dataset name\tPath to folder\tPath to taxonomy\tDescription" > bigslice_input/datasets.tsv
echo -e "AgB_gbk\tAgB_gbk/\ttaxonomy/AgB_gbk_taxonomy.tsv\tNULL" >> bigslice_input/datasets.tsv

echo -e "# Genome folder\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies\tOrganism" > bigslice_input/taxonomy/AgB_gbk_taxonomy.tsv
echo -e "gbks/\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown" >> bigslice_input/taxonomy/AgB_gbk_taxonomy.tsv

When I run

bigslice -i bigslice_input \
  --complete \
  -t 4 \
  bigslice_centroids_output

during the parsing and inserting step, I get: gbks/BGC0000056.gbk is not a recognized antiSMASH clustergbk. And, I get the same message for each MIBiG gbk. At the same time, my own gbks seem to work.

Can you help? I'm wondering if it has to do with the eligible regex definitions on a newer release of MIBiG? I'd try to debug myself, but my programming skills are pretty novice.

Thanks! Nicholas

nicholascdove commented 1 year ago

Hmm, maybe it's not a regex thing? I renamed the MIBiG gbks to try to match my gbks that worked.

My gbks that were parsed, inserted, and clustered looked like this: AIM000021_asm31892_contig20486033.region001.gbk Original MIBiG gbk: BGC0002286.gbk Trying to add a region string: BGC0002286.region001.gbk Trying to break the BGC part of the regex definition so that it uses ^.+\\.region[0-9]+$: ABGC0002286.region001.gbk

Unfortunately none of these naming "hacks" were able to get BiG-SLiCE to recognize these MIBiG gbks as AntiSMASH gbks. Also, all of the files (my gbks and the MIBiG gbks) were in the same folder, so I don't think its a directory issue.

nicholascdove commented 1 year ago

Not actually "closed"; I just hit the wrong button. :)

nicholascdove commented 1 year ago

Looks like my issue has more to do with the parse_gbk() command from bgc.py. On line 98-170, there is an if/else statement that treats different versions of AntiSMASH gbks differently.

Line 98: if antismash_version.split(".")[0] in ["5", "6"]: Line 170: else: # assume antiSMASH 4

The problem is that current MIBiG gbks do not have an AntiSMASH version:

image So, this if/else treats them like an antiSMASH 4 gbk and searches for the feature cluster, and therefore, does not recognize them as an antiSMASH gbk. Line 170-182:

 else:  # assume antiSMASH 4
                cluster = None
                for feature in gbk.features:
                    if feature.type == "cluster":
                        if cluster:  # contain 2 or more clusters
                            cluster = None
                            break
                        else:
                            cluster = feature
                if not cluster:
                    print(orig_gbk_path +
                          " is not a recognized antiSMASH clustergbk")
                    break

Maybe this is the issue? Please let me know. Thanks!

nicholascdove commented 1 year ago

Okay, I figured it out. The assumption in my last comment was correct.

For others who are running into a similar issue, my work around was to change the version in the MIBiG gbk from FALSE to 5.0.0. You can use the following code in a for loop: sed 's/Version :: False/Version :: 5.0.0/' BGC000001.gbk > BGC000001.gbk

I'm going to leave the issue open so the bug can be fixed in the package :)

BioGavin commented 8 months ago

Here is the command for batch modification: for i in mibig_gbk_3.1/*.gbk; do sed -i 's/Version :: False/Version :: 5.0.0/' $i; done