Key Error when concat datalist into a pd DataFrame

yangyxt commented 4 years ago

Sorry for forgetting to attach the first part of the error log:

yangyxt commented 4 years ago

Just found out that may due to the fetched sequence contain an N (which can be any base). I'll use a try-except clause to deal with that.

I'm not quite used to using "|" between integers. I tried on jupyter and found it is equal to "+" ? May I have your opinion on whether this amendment makes sense.

Thanks

vincentiusmartin commented 4 years ago

@yangyxt sorry for the slow response, yes it's because of N, can you filter out sequences with N? our indexing works only for A/C/G/T

shashwatsahay commented 2 years ago

Hey @vincentiusmartin,

Sorry for commenting on a closed comment but I have updated the code of the function chrom_cidx_helper to deal with this.

def chrom_cidx_helper(cidx, cidx_dataset, chromosome_version, kmer):
    print("Iterating dataset for chromosome {}...".format(cidx))
    chromosome = utils.get_chrom(config.CHRDIR + "/" + chromosome_version + "/chr." + str(cidx) + '.fa.gz')
    result = []
    for idx,row in cidx_dataset.iterrows():
        pos = row['pos'] - 1
        if row['mutated_from'] != chromosome[pos]:
            error = "For the input mutation %s>%s at position %s in chromosome %s, the mutated_from nucleotide (%s) does not match the nucleotide in the %s reference genome (%s). Please check the input data and verify that the correct version of the reference human genome was used." % (row['mutated_from'], row['mutated_to'], row['pos'], row['chromosome'], row['mutated_from'], chromosome_version, chromosome[pos])
            #raise Exception(error)
            print(error)
        seq = chromosome[pos-kmer+1:pos+kmer] + row['mutated_to'] #-5,+6
        if 'N' in seq:
            print('Skipping: '+str(row['chromosome'])+' '+str(row['pos'])+' '+str(row['mutated_from'])+' '+str(row['mutated_to']))
            continue
        # for escore, just use 8?
        escore_seq = chromosome[pos-9+1:pos+9] + row['mutated_to']
        result.append([idx,seq,escore_seq,utils.seqtoi(seq),0,0,"None"]) #rowidx,seq,escore_seq,val,diff,t,pbmname
    return result

In addition I have forked and commited this change on my repo for this

vincentiusmartin / QBiC-Pred

Key Error when concat datalist into a pd DataFrame #14