mkusner / grammarVAE

Code for the "Grammar Variational Autoencoder" https://arxiv.org/abs/1703.01925
269 stars 78 forks source link

What does each code do #30

Closed nzarnaghi closed 3 years ago

nzarnaghi commented 3 years ago

Hi,

I want to use the code to implement it on RNA data with a different grammar. Could you please let me know what the code molecule_vae.py does? What is the difference between make_zinc_dataset_grammar.py and make_zinc_dataset_str.py? If I want to change the grammar and the input data, which parts of the codes need to be changed?

Thank you

mkusner commented 3 years ago

molecule_vae.py is a set of helper functions for building the grammar, with functions to encode sequences into grammar rules and back. make_zinc_dataset_grammar.py makes a one-hot dataset of grammar rules from sequence data, whereas make_zinc_dataset_str makes a one-hot dataset from characters of the sequence.

If you want to use a new grammar and input I'd recommend creating a copy of all files that have the word 'zinc' in them and molecule_vae. The most important thing will be creating new grammar rules. To respond to your email: the grammar needs to unambiguously decode a sequence so for instance if you have these grammar rules: ’seq -> rna' ‘rna -> A rna' ‘rna -> C rna' ‘rna -> AC rna' ‘rna -> '

and you are trying to decode the sequence ‘AC’ it is unclear if the parser should use rules 1, 2, 3, 5 or rules 1, 4, 5. You have to make sure your rules do not overlap like this. See the equation and molecule grammars for inspiration.

nzarnaghi commented 3 years ago

Thank you so much for your help. The RNA sequence consists of the characters 'A', 'C', 'U', 'G'. I have written a grammar as below:

gram = """S -> L S | L
L -> 'A' F 'U' | 'A' | 'U' F 'A' | 'U' | 'C' F 'G' | 'C' | 'G' F 'C' | 'G'
F -> 'A' F 'U' | 'U' F 'A' | 'C' F 'G' | 'G' F 'C' | L S
Nothing -> Nones
"""

But it does not work. Actually, it only parses single characters and not the whole sequence. Could you please guide me how to fix it?

mkusner commented 3 years ago

The grammar won't parse because L and F have the same first rule, which rule is the grammar supposed to choose?

nzarnaghi commented 3 years ago

The grammar is provided below: image

Do you think that the code that I have written is wrong? The grammar is proposed in a paper.

mkusner commented 3 years ago

Hmm, my guess is the grammar would throw errors in the nltk library, does it?

Even if it doesn't throw errors this sounds like a simple coding error: you need to pass the entire sequence to nltk. See encode() in molecule_vae.py for how this works in the molecule example. There are also nltk tutorials that may help.

nzarnaghi commented 3 years ago

Thank you so much for your reply. I checked the encode() code as well. Actually, I followed the code make_zinc_dataset_grammar.py. Some parts of the encode() code in molecule_vae is similar to make_zinc_dataset_grammar.py for making the one-hot vectors. Before implementing the parser, I tokenized the sequence. I checked this part multiple times. I chose a sequence which could be tokenized in a similar way for both the zinc tokenization and RNA tokenizaton. The tokenized sequences was similar, but the zinc grammar could parse it although the RNA grammar did not work. I guess probably chart parser does not work for this grammar and I should look for another parser. What is your opinion?

nzarnaghi commented 3 years ago

The problem is solved. I think there was a problem with calling the grammar from another python code. But it works now. Thank you so much.