Closed yangyxt closed 4 years ago
Sorry for forgetting to attach the first part of the error log:
Just found out that may due to the fetched sequence contain an N (which can be any base). I'll use a try-except clause to deal with that.
I'm not quite used to using "|" between integers. I tried on jupyter and found it is equal to "+" ? May I have your opinion on whether this amendment makes sense.
Thanks
@yangyxt sorry for the slow response, yes it's because of N, can you filter out sequences with N? our indexing works only for A/C/G/T
Hey @vincentiusmartin,
Sorry for commenting on a closed comment but I have updated the code of the function chrom_cidx_helper
to deal with this.
def chrom_cidx_helper(cidx, cidx_dataset, chromosome_version, kmer):
print("Iterating dataset for chromosome {}...".format(cidx))
chromosome = utils.get_chrom(config.CHRDIR + "/" + chromosome_version + "/chr." + str(cidx) + '.fa.gz')
result = []
for idx,row in cidx_dataset.iterrows():
pos = row['pos'] - 1
if row['mutated_from'] != chromosome[pos]:
error = "For the input mutation %s>%s at position %s in chromosome %s, the mutated_from nucleotide (%s) does not match the nucleotide in the %s reference genome (%s). Please check the input data and verify that the correct version of the reference human genome was used." % (row['mutated_from'], row['mutated_to'], row['pos'], row['chromosome'], row['mutated_from'], chromosome_version, chromosome[pos])
#raise Exception(error)
print(error)
seq = chromosome[pos-kmer+1:pos+kmer] + row['mutated_to'] #-5,+6
if 'N' in seq:
print('Skipping: '+str(row['chromosome'])+' '+str(row['pos'])+' '+str(row['mutated_from'])+' '+str(row['mutated_to']))
continue
# for escore, just use 8?
escore_seq = chromosome[pos-9+1:pos+9] + row['mutated_to']
result.append([idx,seq,escore_seq,utils.seqtoi(seq),0,0,"None"]) #rowidx,seq,escore_seq,val,diff,t,pbmname
return result
In addition I have forked and commited this change on my repo for this