pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.52k stars 812 forks source link

How can I parse sequence of sentences #685

Open antgr opened 4 years ago

antgr commented 4 years ago

❓ Questions and Help

Description

1 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...] 2 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...] etc How could I read this file? Any suggestion? I want to do sequence labeling on sentence level.

icmpnorequest commented 4 years ago

Firstly, torchtext could only deal with three types of file format: csv, tsv and json. Before using torchtext, you should ensure that your file is saved as one of the three above format.

Secondly, you could treat the data format for sequence labeling task as following (csv format):

# For sentence 1
lab1,word1
lab2,word2
...
labn,words

# For sentence 2
lab1,word1
lab2,word2
...

If pre-processing the data file into such format, there are two columns---label and word. You could use data.Field() and data.LabelField() to load the data.

May it could help you.

antgr commented 4 years ago

Hi @icmpnorequest
thanks for your answer. What you suggest looks good for sequence labeling task at word level: for the case where for each word i have a different label. Here I have a label for an entire sentence. So to be honest, I think that your answer does not help me much. But thanks a lot!

icmpnorequest commented 4 years ago

Hi @antgr Sorry, I didn't catch the sentence level before.

1 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...]
2 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...]

By the way, I was wondering what's 1 and 2 in the first column representing? You wanna do some task like text classification at sentence level?

antgr commented 4 years ago

You can see here an example of this use case: https://raw.githubusercontent.com/titipata/detecting-scientific-claim/master/dataset/train_labels.json

"labels": ["0", "0", "0", "0", "0", "1", "1"],  "sentences": [**"Understanding the molecular basis of species formation is an important goal in evolutionary genetics, and Dobzhansky-Muller incompatibilities are thought to be a common source of postzygotic reproductive isolation between closely related lineages."**, "However, the evolutionary forces that lead to the accumulation of such incompatibilities between diverging taxa are poorly understood.", **"Segregation distorters are believed to be an important source of Dobzhansky-Muller incompatibilities between hybridizing species of Drosophila as well as hybridizing crop plants, but it remains unclear if these selfish genetic elements contribute to reproductive isolation in other taxa."**, "Here, we collected viable sperm from first-generation hybrid male progeny of Mus musculus castaneus and M. m. domesticus, two subspecies of rodent in the earliest stages of speciation.", **"We then genotyped millions of single nucleotide polymorphisms in these gamete pools and tested for a skew in the frequency of parental alleles across the genome."**, "We show that segregation distorters are not measurable contributors to observed infertility in these hybrid males, despite sufficient statistical power to detect even weak segregation distortion with our novel method.", **"Thus, reduced hybrid male fertility in crosses between these nascent species is attributable to other evolutionary forces."**], 

@icmpnorequest what do you think?

icmpnorequest commented 4 years ago

@antgr What about keeping three columns: paper_id, label, sentence?

Take paper_id 26121240 as an example (save as a csv file):

paper_id,label,sentence
26121240,0,Understanding the molecular basis of species formation is an important goal in evolutionary genetics, and Dobzhansky-Muller incompatibilities are thought to be a common source of postzygotic reproductive isolation between closely related lineages.
26121240,0,However, the evolutionary forces that lead to the accumulation of such incompatibilities between diverging taxa are poorly understood.
26121240,0,Segregation distorters are believed to be an important source of Dobzhansky-Muller incompatibilities between hybridizing species of Drosophila as well as hybridizing crop plants, but it remains unclear if these selfish genetic elements contribute to reproductive isolation in other taxa.

Keeping the paper_id feature is to distinguish whether the sentences come from the same paper. Then, extract each sentence in the sentences (\ object) and save it in each row. If you wanna do the sequence labeling task, maybe it's ok, but I don't know if it's fit for text summarization/generation task.

What do you think about it?

antgr commented 4 years ago

@icmpnorequest I am OK with the paper_id,label,sentence format.

Lets first agree that its the best option to treat the problem as a sequence labeling task, for sentences. This way, we take into consideration what kind of sentence preceded, and what kind of sentence follows. This information can provide hint about what kind of information this sentence captures. Which is what we actually want to learn. Now that we agreed that we want to attack the problem as a sequence labeling task (at sentence level), I would like to understand how this could be implemented. Can I use torchtext to read the dataset?

paper_id,label,sentence
26121240,0,Understanding the molecular basis of species formation is an important goal in evolutionary genetics, and Dobzhansky-Muller incompatibilities are thought to be a common source of postzygotic reproductive isolation between closely related lineages.
26121240,0,However, the evolutionary forces that lead to the accumulation of such incompatibilities between diverging taxa are poorly understood.
26121240,0,Segregation distorters are believed to be an important source of Dobzhansky-Muller incompatibilities between hybridizing species of Drosophila as well as hybridizing crop plants, but it remains unclear if these selfish genetic elements contribute to reproductive isolation in other taxa.

How could I do it? Should I inherit from one of the torch text classes? Or do you think that I should not use torchtext and do the parsing manually?

icmpnorequest commented 4 years ago

Hi @antgr , torchtext helps to build the text dataset into batch instead of manually transforming the text into batch. I think it's one of the most suitable tool for text classification task. If you wanna capture the hint between sentences, why not use some pre-trained vectors/model to embed them?

Here is a pipeline with torchtext using the custom dataset for sequence labeling task:

# 1. data.Field()
PAPERID = data.Field()
LABEL = data.LabelField()
TEXT = data.Field(tokenize='spacy')

# 2. data.TabularDataset
train_data, valid_data, test_data = data.TabularDataset.splits(path=dataset_path,
                                                   train="train.csv",
                                                   valid="valid.csv",
                                                   test="test.csv",
                                                   fields=[('paper_id', PAPERID),('label', LABEL), ('text', TEXT)],
                                                   format="csv")

# 3. data.BucketIterator
train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data, valid_data, test_data),
                                                               batch_size=BATCH_SIZE,
                                                               device=device,
                                                               sort_key=lambda x: len(x.text))

# 4. Build vocab
PAPERID.build_vocab(train_data)
TEXT.build_vocab(train_data)
# If you wanna use some pre-trained vectors, like GloVe, you could add
# Text.build_vocab(train_data, vectors="glove.6B.300d")

LABEL.build_vocab(train_data)