katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
125 stars 65 forks source link

SRST2v0.1.5 crashes when using gene_db with only gene names in the FastA headers #34

Closed cglambert closed 8 years ago

cglambert commented 9 years ago

When using a gene_db with FastA headers in the form ">geneid" or ">geneid additional information", srst2 terminates on a crash (see traceback below).

Attempting to read xxx loci from ST database yyy
Read ST database yyy successfully
Traceback (most recent call last):
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1592, in <module>
    main()
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1548, in main
    db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1102, in run_srst2
    db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1164, in process_fasta_db
    unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1275, in map_fileSet_to_db
    unique_gene_symbols, unique_allele_symbols, pileup_file)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 788, in parse_scores
    gene_name = get_allele_name_from_db(allele,unique_allele_symbols,unique_cluster_symbols,run_type,args)[2] # cluster ID
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 750, in get_allele_name_from_db
    cluster_id = gene_name = allele_name = seqid = allele_parts[0]
IndexError: list index out of range

As a temporary solution, I modified the code following the patch hereafter.

--- ./srst2orig.py 2015-02-10 14:54:08.000000000 +0100
+++ ./srst2.py  2015-02-10 16:23:16.000000000 +0100
@@ -202,7 +202,8 @@
                                                gene_cluster_symbols[gene_cluster] = cluster_symbol
                                else:
                                        # treat as unclustered database, use whole header
-                                       gene_cluster = cluster_symbol = name
+                                       gene_cluster = cluster_symbol = name.split()[0] #debug: name
+                                       gene_cluster_symbols[gene_cluster] = cluster_symbol #debug
                        else:
                                gene_cluster = name.split(delimiter)[0] # accept gene clusters raw for mlst
                                # check if the delimiter makes sense
@@ -738,7 +739,7 @@
        if run_type != "mlst":
                # header format: >[cluster]___[gene]___[allele]___[uniqueID] [info]
                allele_parts = allele.split()
-               allele_detail = allele_parts.pop(0)
+               allele_detail = allele_parts[0] #debug allele_parts.pop(0)
                allele_info = allele_detail.split("__")

                if len(allele_info)>2:

Please note there is a potential bug in the last line “if len(allele_info)>2:” this should be “if len(allele_info)>3:” instead.

Best Regards, Christophe

katholt commented 8 years ago

Thanks, this is going into the v0.1.6 release