Closed nzarnaghi closed 3 years ago
molecule_vae.py is a set of helper functions for building the grammar, with functions to encode sequences into grammar rules and back. make_zinc_dataset_grammar.py makes a one-hot dataset of grammar rules from sequence data, whereas make_zinc_dataset_str makes a one-hot dataset from characters of the sequence.
If you want to use a new grammar and input I'd recommend creating a copy of all files that have the word 'zinc' in them and molecule_vae. The most important thing will be creating new grammar rules. To respond to your email: the grammar needs to unambiguously decode a sequence so for instance if you have these grammar rules: ’seq -> rna' ‘rna -> A rna' ‘rna -> C rna' ‘rna -> AC rna' ‘rna -> '
and you are trying to decode the sequence ‘AC’ it is unclear if the parser should use rules 1, 2, 3, 5 or rules 1, 4, 5. You have to make sure your rules do not overlap like this. See the equation and molecule grammars for inspiration.
Thank you so much for your help. The RNA sequence consists of the characters 'A', 'C', 'U', 'G'. I have written a grammar as below:
gram = """S -> L S | L
L -> 'A' F 'U' | 'A' | 'U' F 'A' | 'U' | 'C' F 'G' | 'C' | 'G' F 'C' | 'G'
F -> 'A' F 'U' | 'U' F 'A' | 'C' F 'G' | 'G' F 'C' | L S
Nothing -> Nones
"""
But it does not work. Actually, it only parses single characters and not the whole sequence. Could you please guide me how to fix it?
The grammar won't parse because L and F have the same first rule, which rule is the grammar supposed to choose?
The grammar is provided below:
Do you think that the code that I have written is wrong? The grammar is proposed in a paper.
Hmm, my guess is the grammar would throw errors in the nltk library, does it?
Even if it doesn't throw errors this sounds like a simple coding error: you need to pass the entire sequence to nltk. See encode() in molecule_vae.py for how this works in the molecule example. There are also nltk tutorials that may help.
Thank you so much for your reply. I checked the encode() code as well. Actually, I followed the code make_zinc_dataset_grammar.py. Some parts of the encode() code in molecule_vae is similar to make_zinc_dataset_grammar.py for making the one-hot vectors. Before implementing the parser, I tokenized the sequence. I checked this part multiple times. I chose a sequence which could be tokenized in a similar way for both the zinc tokenization and RNA tokenizaton. The tokenized sequences was similar, but the zinc grammar could parse it although the RNA grammar did not work. I guess probably chart parser does not work for this grammar and I should look for another parser. What is your opinion?
The problem is solved. I think there was a problem with calling the grammar from another python code. But it works now. Thank you so much.
Hi,
I want to use the code to implement it on RNA data with a different grammar. Could you please let me know what the code molecule_vae.py does? What is the difference between make_zinc_dataset_grammar.py and make_zinc_dataset_str.py? If I want to change the grammar and the input data, which parts of the codes need to be changed?
Thank you