Missing tokenize_sent.sh and detokenize.sh ?

GenTxt commented 5 years ago

Thanks for the interesting repo.

I've downloaded the specified version of the stanford parser but I can't find "stanford-parser-full-2017-06-09/tokenize_sent.sh" and "detokenize.sh" required in the notebook.

Are these renamed files in the parser folder? If not, is it possible to upload these to this repo or provide a link?

Thanks,

superMDguy commented 5 years ago

Sorry, this is super messy, and all built around stuff I have downloaded. I actually don't have access to the machine I have the code on right now, and won't for a few weeks, so I don't have the exact files. I do know roughly what their contents are though:

tokenize_sent.sh is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt. I might also have the -preserveLines option, but I'm not sure if I'm using that.
detokenize_sent.sh is pretty much java edu.stanford.nlp.process.DocumentPreprocessor /tmp/in.txt > /tmp/out.txt.

I put both of those files in the same folder as the parser to simplify things. I don't use java much, but I probably should've added it to the classpath or something. If you aren't able to get that to work, you could probably throw in a different sentence tokenizer without much of a difference.

Good luck on getting it to work, and let me know how it goes!

GenTxt commented 5 years ago

Thanks for the quick reply and solution. Hopefully only one more as described below. Have parser and required .sh files in: Datasets/stanford-parser-full-2017-06-09/tokenize_sent.sh (and detokenize.sh) Runs without error in revised jupyter notebook.

Changed location of books to: prefix = 'Textfiles/sources/' (same program folder containing 'Datasets') Placed renamed text files 'JaneAustenNorthanger_Abbey.txt' and 'Sir_Arthur_Conan_Doyle' in 'sources' subfolder. Also changed some of the code in notebook to remove spaces in output names (running ubuntu 18) Continue running notebook and generates this 'FileNotFound' error:

changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle')) write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir_Arthur_Conan_Doyle's_Northanger_Abbey", changed)

FileNotFoundError Traceback (most recent call last)

in ----> 1 changed = change_book(open(prefix + 'JaneAustenNorthanger_Abbey.txt').read(), get_corpus('Sir_Arthur_Conan_Doyle')) 2 write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir_Arthur_Conan_Doyle's_Northanger_Abbey", changed) FileNotFoundError: [Errno 2] No such file or directory: 'Textfiles/sources/JaneAustenNorthanger_Abbey.txt' Changed to: prefix = 'Datasets/Textfiles/sources/' (similar location logic as original) Same error No such file or directory: 'Datasets/Textfiles/sources/JaneAustenNorthanger_Abbey.txt' Not sure why notebook isn't finding the file. I'm new to python and would appreciate knowing how to fix this for future projects. Thanks

superMDguy commented 5 years ago

So the Datasets folder is inside this projects directory?

Also, I think you'll have problems with the get_corpus('Sir_Arthur_Conan_Doyle') call. The get_corpus() method assumes that you have the project Gutenberg dataset from here downloaded and unzipped into the prefix directory. If you're using other files, it should still work, but you'll have to replace the get_corpus call with something like open(FILE_NAME).read().

GenTxt commented 5 years ago

Yes.

SentenceEmbeddings/SentenceChange.ipynb (revised notebook)

SentenceEmbeddings/Infersent/dataset (model)

SentenceEmbeddings/Infersent/encoder (.pkl)

SentenceEmbeddings/Datasets/stanford-parser-full-2017-06-09 (parser and .sh files)

SentenceEmbeddings/Datasets/Textfiles/sources (revised location and named text files)

Downloading PG archive and will install as per your advice and make changes as necessary.

Thanks

On Fri, Dec 21, 2018 at 10:09 AM Matthew Dangerfield < notifications@github.com> wrote:

So the Datasets folder is inside this projects directory?

Also, I think you'll have problems with the get_corpus('Sir_Arthur_Conan_Doyle') call. The get_corpus() method assumes that you have the project Gutenberg dataset from here https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html downloaded and unzipped into the prefix directory. If you're using other files, it should still work, but you'll have to replace the get_corpus call with something like open(FILE_NAME).read().

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/superMDguy/nanogenmo-2018/issues/1#issuecomment-449412606, or mute the thread https://github.com/notifications/unsubscribe-auth/AVgLPaFZ5-B05K1pvb5QAgy83Eeu96uUks5u7Pm4gaJpZM4Zc5xE .

GenTxt commented 5 years ago

Hi: Have Gutenberg setup and running notebook with original get_corpus call. Changed a few directory locations, for example all instances of '/tmp/in.txt' and '/tmp/out.txt' to 'Datasets/tmp/in.txt' and 'Datasets/tmp/out.txt'. Appears to be working until this error below.

FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt'

Writes 'in.txt' (processed Northranger Abbey) to 'Datasets/tmp/in.txt' but not 'out.txt'

I would appreciate any suggestions on how to fix. Thanks

FileNotFoundError Traceback (most recent call last)

in ----> 1 changed = change_book(open(prefix + 'Jane Austen___Northanger Abbey.txt').read(), get_corpus('Sir Arthur Conan Doyle')) 2 write_file('Northanger_Abbey_x_Doyle_with_translation', "Sir Arthur Conan Doyle's Northanger Abbey", changed) in change_book(toChange, source, withTranslation, useAnnoy, maxChars) 1 def change_book(toChange, source, withTranslation = True, useAnnoy = False, maxChars = 5000000): ----> 2 toChangeSent = tokenize_sentences(toChange) 3 sourceSent = tokenize_sentences(source[:maxChars]) 4 5 model.build_vocab(toChangeSent + sourceSent, tokenize=True) in tokenize_sentences(text) 3 open('Datasets/tmp/in.txt', 'w').write(text.replace('\n\n', NEWLINE)) 4 os.system('Datasets/stanford-parser-full-2017-06-09/tokenize_sent.sh') ----> 5 tokens = open('Datasets/tmp/out.txt').read().split('\n') 6 print('Total tokens in dataset', len(tokens)) 7 FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/tmp/out.txt'

superMDguy commented 5 years ago

I know it's some issue with the file paths in the sentence tokenizer, but I'm not sure exactly what. I would try changing tokenize_sent.sh to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt and detokenize_sent.sh to java edu.stanford.nlp.process.DocumentPreprocessor ../tmp/in.txt > ../tmp/out.txt.

GenTxt commented 5 years ago

Thanks for the reply. Unfortunately it's the same error. I'll close this for now and search for possible solutions.

Cheers

superMDguy commented 5 years ago

Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want.

GenTxt commented 5 years ago

Hi:

Thanks for getting back to me. Same errors as before. Have tried numerous > redirect combinations but generates original error or new. Posted a "please help" on stackoverflow but no solution.

Can you post the original tokenize.sh and detokenize.sh?

Any help is appreciated.

On Fri, Jan 18, 2019 at 12:05 PM Matthew Dangerfield < notifications@github.com> wrote:

Were you ever able to get it to work? I have access to the machine that I originally developed this on now, so I might be able to help you more, if you want.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/superMDguy/nanogenmo-2018/issues/1#issuecomment-455617631, or mute the thread https://github.com/notifications/unsubscribe-auth/AVgLPcUoof7wCVsXJoSHes3VdmVL0eDKks5vEf68gaJpZM4Zc5xE .

superMDguy commented 5 years ago

Wow, looks like there are several differences from what I remembered. You'll still have to change the /tmp/in.txt paths to wherever your tmp directory is relative to the parser directory. But, other than that, you should be able to use the same files.

tokenize.sh:

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt

detokenize.sh

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt

GenTxt commented 5 years ago

Works like a charm now with PG format text files. Thanks.

Testing unwrapped line texts but that generates out of memory errors on my GTX 1070. Will test java memory settings.

Cheers

On Fri, Jan 18, 2019 at 12:57 PM Matthew Dangerfield < notifications@github.com> wrote:

Wow, looks like there are several differences from what I remembered. You'll still have to change the /tmp/in.txt paths to wherever your tmp directory is relative to the parser directory. But, other than that, you should be able to use the same files.

tokenize.sh:

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -preserveLines /tmp/in.txt > /tmp/out.txt

detokenize.sh

export CLASSPATH=$(dirname $0)/stanford-parser.jar

java edu.stanford.nlp.process.PTBTokenizer -untok /tmp/in.txt > /tmp/out.txt

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/superMDguy/nanogenmo-2018/issues/1#issuecomment-455633753, or mute the thread https://github.com/notifications/unsubscribe-auth/AVgLPbsCPCMG6qMzQU09KAZJXbefwiI5ks5vEgscgaJpZM4Zc5xE .

superMDguy / nanogenmo-2018

Missing tokenize_sent.sh and detokenize.sh ? #1