Closed GenTxt closed 5 years ago
Sorry, this is super messy, and all built around stuff I have downloaded. I actually don't have access to the machine I have the code on right now, and won't for a few weeks, so I don't have the exact files. I do know roughly what their contents are though:
tokenize_sent.sh
is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt
. I might also have the -preserveLines
option, but I'm not sure if I'm using that. detokenize_sent.sh
is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt
.I put both of those files in the same folder as the parser to simplify things. I don't use java much, but I probably should've added it to the classpath or something. If you aren't able to get that to work, you could probably throw in a different sentence tokenizer without much of a difference.
Good luck on getting it to work, and let me know how it goes!
Thanks for the quick reply and solution. Hopefully only one more as described below. Have parser and required .sh files in: Datasets/stanford-parser-full-2017-06-09/tokenize_sent.sh (and detokenize.sh) Runs without error in revised jupyter notebook.
Changed location of books to: prefix = 'Textfiles/sources/' (same program folder containing 'Datasets') Placed renamed text files 'JaneAustenNorthanger_Abbey.txt' and 'Sir_Arthur_Conan_Doyle' in 'sources' subfolder. Also changed some of the code in notebook to remove spaces in output names (running ubuntu 18) Continue running notebook and generates this 'FileNotFound' error:
changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle')) write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir_Arthur_Conan_Doyle's_Northanger_Abbey", changed)
FileNotFoundError Traceback (most recent call last)
So the Datasets
folder is inside this projects directory?
Also, I think you'll have problems with the get_corpus('Sir_Arthur_Conan_Doyle')
call. The get_corpus()
method assumes that you have the project Gutenberg dataset from here downloaded and unzipped into the prefix directory. If you're using other files, it should still work, but you'll have to replace the get_corpus
call with something like open(FILE_NAME).read()
.
Yes.
SentenceEmbeddings/SentenceChange.ipynb (revised notebook)
SentenceEmbeddings/Infersent/dataset (model)
SentenceEmbeddings/Infersent/encoder (.pkl)
SentenceEmbeddings/Datasets/stanford-parser-full-2017-06-09 (parser and .sh files)
SentenceEmbeddings/Datasets/Textfiles/sources (revised location and named text files)
Downloading PG archive and will install as per your advice and make changes as necessary.
Thanks
On Fri, Dec 21, 2018 at 10:09 AM Matthew Dangerfield < notifications@github.com> wrote:
So the Datasets folder is inside this projects directory?
Also, I think you'll have problems with the get_corpus('Sir_Arthur_Conan_Doyle') call. The get_corpus() method assumes that you have the project Gutenberg dataset from here https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html downloaded and unzipped into the prefix directory. If you're using other files, it should still work, but you'll have to replace the get_corpus call with something like open(FILE_NAME).read().
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/superMDguy/nanogenmo-2018/issues/1#issuecomment-449412606, or mute the thread https://github.com/notifications/unsubscribe-auth/AVgLPaFZ5-B05K1pvb5QAgy83Eeu96uUks5u7Pm4gaJpZM4Zc5xE .
Hi: Have Gutenberg setup and running notebook with original get_corpus call. Changed a few directory locations, for example all instances of '/tmp/in.txt' and '/tmp/out.txt' to 'Datasets/tmp/in.txt' and 'Datasets/tmp/out.txt'. Appears to be working until this error below.
FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt'
Writes 'in.txt' (processed Northranger Abbey) to 'Datasets/tmp/in.txt' but not 'out.txt'
I would appreciate any suggestions on how to fix. Thanks
FileNotFoundError Traceback (most recent call last)
I know it's some issue with the file paths in the sentence tokenizer, but I'm not sure exactly what. I would try changing tokenize_sent.sh
to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt
and detokenize_sent.sh
to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt
.
Thanks for the reply. Unfortunately it's the same error. I'll close this for now and search for possible solutions.
Cheers
Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want.
Hi:
Thanks for getting back to me. Same errors as before. Have tried numerous > redirect combinations but generates original error or new. Posted a "please help" on stackoverflow but no solution.
Can you post the original tokenize.sh and detokenize.sh?
Any help is appreciated.
On Fri, Jan 18, 2019 at 12:05 PM Matthew Dangerfield < notifications@github.com> wrote:
Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/superMDguy/nanogenmo-2018/issues/1#issuecomment-455617631, or mute the thread https://github.com/notifications/unsubscribe-auth/AVgLPcUoof7wCVsXJoSHes3VdmVL0eDKks5vEf68gaJpZM4Zc5xE .
Wow, looks like there are several differences from what I remembered. You'll still have to change the /tmp/in.txt
paths to wherever your tmp
directory is relative to the parser directory. But, other than that, you should be able to use the same files.
tokenize.sh
:
export CLASSPATH=$(dirname $0)/stanford-parser.jar
java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt
detokenize.sh
export CLASSPATH=$(dirname $0)/stanford-parser.jar
java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt
Works like a charm now with PG format text files. Thanks.
Testing unwrapped line texts but that generates out of memory errors on my GTX 1070. Will test java memory settings.
Cheers
On Fri, Jan 18, 2019 at 12:57 PM Matthew Dangerfield < notifications@github.com> wrote:
Wow, looks like there are several differences from what I remembered. You'll still have to change the /tmp/in.txt paths to wherever your tmp directory is relative to the parser directory. But, other than that, you should be able to use the same files.
tokenize.sh:
export CLASSPATH=$(dirname $0)/stanford-parser.jar
java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt
detokenize.sh
export CLASSPATH=$(dirname $0)/stanford-parser.jar
java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/superMDguy/nanogenmo-2018/issues/1#issuecomment-455633753, or mute the thread https://github.com/notifications/unsubscribe-auth/AVgLPbsCPCMG6qMzQU09KAZJXbefwiI5ks5vEgscgaJpZM4Zc5xE .
Thanks for the interesting repo.
I've downloaded the specified version of the stanford parser but I can't find "stanford-parser-full-2017-06-09/tokenize_sent.sh" and "detokenize.sh" required in the notebook.
Are these renamed files in the parser folder? If not, is it possible to upload these to this repo or provide a link?
Thanks,