a bug on `random_positive` in `dataloader.py`

1511878618 commented 1 year ago

There is a bug in the random_positivefunction in dataloader. When there is only one protein ID in EC, like EC:3.4.22.54, and it only have ['Q9TTH8'] in it

Note, i use split10 dataset, so maybe lack of some sequence in this EC, but i think it's doesnt matter about this bug it will return Q9TTH8_x ( where x is a randint num among 1-9,) as the positive sequence. However, since this sequence does not exist in the data, it will prompt that the sequence cannot be read when the dataloader is run subsequently.

So i think the question is here

def random_positive(id, id_ec, ec_id): 
    pos_ec = random.choice(id_ec[id])
    pos = id
    if len(ec_id[pos_ec]) == 1:  # this is the error comes from, and i think if only 1 in EC then use this is a better idea
        return pos + '_' + str(random.randint(0, 9))
    while pos == id:  # also this could change into a new list which don't contain the anchor(in this function is id variable), and then use random.choice will be efficient
        pos = random.choice(ec_id[pos_ec])
    return pos

all in the comments is my opinion, and i will raise a PR soon later

tttianhao commented 1 year ago

Ah I see. Thanks for posting this. _x is created when only one sequence has a particular EC number. Since we need to sample positive sequences other than the anchor sequence, we mutated the anchor sequence and use the mutated sequences as positive sequences. I might overlooked the implementation of this part in this repo. Thanks for pointing out ^^

1511878618 commented 1 year ago

Ah I see. Thanks for posting this. _x is created when only one sequence has a particular EC number. Since we need to sample positive sequences other than the anchor sequence, we mutated the anchor sequence and use the mutated sequences as positive sequences. I might overlooked the implementation of this part in this repo. Thanks for pointing out ^^

Ok, btw, how do u make the mutated sequences as positive sequences , it seems it doesn't be mentioned at the origin paper or maybe i didn't notice it,lol. Thx for replying~

canallee commented 1 year ago

@1511878618 Ahh, this was a mistake while cleaning up our development code. We have a script for mutating the sequence for the EC numbers with only one sequence (to enable gradient computation for the contrastive loss). This script should go before the training and after the retrieval of the ESM embeddings. The inference shouldn't be affected because we provided the embeddings for each EC number (70.pt and 100.pt), but you are right about running into errors during training. I will create a PR to fix this, thanks for bringing this up!

canallee commented 1 year ago

@1511878618 PR #15

tttianhao / CLEAN

a bug on `random_positive` in `dataloader.py` #13