songlab-cal / tape-neurips2019

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)
https://arxiv.org/abs/1906.08230
MIT License
118 stars 34 forks source link

Fix bug in to_uniprot_bepler function #19

Closed DaHaiHuha closed 4 years ago

DaHaiHuha commented 4 years ago

The original one would return a seq full of 0, need to generate an index mask before assignment the value.

rmrao commented 4 years ago

So as far as I know the two pieces of code should do the same thing?

The original will return a boolean mask where the sequence is equal to the id, while the other will return an index mask instead. This alteration should be categorically worse since it does the loop in python instead of numpy. Do you have an example of this causing a bug?

DaHaiHuha commented 4 years ago

So as far as I know the two pieces of code should do the same thing?

The original will return a boolean mask where the sequence is equal to the id, while the other will return an index mask instead. This alteration should be categorically worse since it does the loop in python instead of numpy. Do you have an example of this causing a bug?

In my experiment, seq == pfam_encoding returned a single False rather than a mask, that leads to a problem. But maybe this is caused different type of data input?

rmrao commented 4 years ago

I believe that seq should be a numpy array and pfam_encoding should be a single number. The only way the current code would fail while your proposed fix would succeed is if seq is a list? But tensorflow, as far as I know, should make seq a numpy array when using tf.py_func.

Can you tell me the types of seq and pfam_encoding when you run your code?

rmrao commented 4 years ago

I mean your code is fine and I can merge it if there is a bug, but I'd like to double check in case this is a sign of some other issue since as far as I know the two should be identical.

DaHaiHuha commented 4 years ago

Thanks for your carefulness and pursuing perfection! I met this problem when I use this convert_sequence_vocab function as an independent function in PyTorch, thanks for your explanation and I learnt this new thing.