Open antgr opened 4 years ago
Firstly, torchtext
could only deal with three types of file format: csv
, tsv
and json
. Before using torchtext
, you should ensure that your file is saved as one of the three above format.
Secondly, you could treat the data format for sequence labeling task as following (csv format):
# For sentence 1
lab1,word1
lab2,word2
...
labn,words
# For sentence 2
lab1,word1
lab2,word2
...
If pre-processing the data file into such format, there are two columns---label
and word
. You could use data.Field()
and data.LabelField()
to load the data.
May it could help you.
Hi @icmpnorequest
thanks for your answer. What you suggest looks good for sequence labeling task at word level: for the case where for each word i have a different label.
Here I have a label for an entire sentence.
So to be honest, I think that your answer does not help me much.
But thanks a lot!
Hi @antgr
Sorry, I didn't catch the sentence level
before.
1 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...]
2 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...]
By the way, I was wondering what's 1 and 2 in the first column representing? You wanna do some task like text classification at sentence level?
You can see here an example of this use case: https://raw.githubusercontent.com/titipata/detecting-scientific-claim/master/dataset/train_labels.json
"labels": ["0", "0", "0", "0", "0", "1", "1"], "sentences": [**"Understanding the molecular basis of species formation is an important goal in evolutionary genetics, and Dobzhansky-Muller incompatibilities are thought to be a common source of postzygotic reproductive isolation between closely related lineages."**, "However, the evolutionary forces that lead to the accumulation of such incompatibilities between diverging taxa are poorly understood.", **"Segregation distorters are believed to be an important source of Dobzhansky-Muller incompatibilities between hybridizing species of Drosophila as well as hybridizing crop plants, but it remains unclear if these selfish genetic elements contribute to reproductive isolation in other taxa."**, "Here, we collected viable sperm from first-generation hybrid male progeny of Mus musculus castaneus and M. m. domesticus, two subspecies of rodent in the earliest stages of speciation.", **"We then genotyped millions of single nucleotide polymorphisms in these gamete pools and tested for a skew in the frequency of parental alleles across the genome."**, "We show that segregation distorters are not measurable contributors to observed infertility in these hybrid males, despite sufficient statistical power to detect even weak segregation distortion with our novel method.", **"Thus, reduced hybrid male fertility in crosses between these nascent species is attributable to other evolutionary forces."**],
@icmpnorequest what do you think?
@antgr
What about keeping three columns: paper_id, label, sentence
?
Take paper_id 26121240
as an example (save as a csv
file):
paper_id,label,sentence
26121240,0,Understanding the molecular basis of species formation is an important goal in evolutionary genetics, and Dobzhansky-Muller incompatibilities are thought to be a common source of postzygotic reproductive isolation between closely related lineages.
26121240,0,However, the evolutionary forces that lead to the accumulation of such incompatibilities between diverging taxa are poorly understood.
26121240,0,Segregation distorters are believed to be an important source of Dobzhansky-Muller incompatibilities between hybridizing species of Drosophila as well as hybridizing crop plants, but it remains unclear if these selfish genetic elements contribute to reproductive isolation in other taxa.
Keeping the paper_id
feature is to distinguish whether the sentences come from the same paper. Then, extract each sentence in the sentences (\ object) and save it in each row. If you wanna do the sequence labeling task, maybe it's ok, but I don't know if it's fit for text summarization/generation task.
What do you think about it?
@icmpnorequest I am OK with the paper_id,label,sentence
format.
Lets first agree that its the best option to treat the problem as a sequence labeling task, for sentences. This way, we take into consideration what kind of sentence preceded, and what kind of sentence follows. This information can provide hint about what kind of information this sentence captures. Which is what we actually want to learn. Now that we agreed that we want to attack the problem as a sequence labeling task (at sentence level), I would like to understand how this could be implemented. Can I use torchtext to read the dataset?
paper_id,label,sentence
26121240,0,Understanding the molecular basis of species formation is an important goal in evolutionary genetics, and Dobzhansky-Muller incompatibilities are thought to be a common source of postzygotic reproductive isolation between closely related lineages.
26121240,0,However, the evolutionary forces that lead to the accumulation of such incompatibilities between diverging taxa are poorly understood.
26121240,0,Segregation distorters are believed to be an important source of Dobzhansky-Muller incompatibilities between hybridizing species of Drosophila as well as hybridizing crop plants, but it remains unclear if these selfish genetic elements contribute to reproductive isolation in other taxa.
How could I do it? Should I inherit from one of the torch text classes? Or do you think that I should not use torchtext and do the parsing manually?
Hi @antgr ,
torchtext
helps to build the text dataset into batch instead of manually transforming the text into batch. I think it's one of the most suitable tool for text classification task. If you wanna capture the hint between sentences, why not use some pre-trained vectors/model to embed them?
Here is a pipeline with torchtext
using the custom dataset for sequence labeling task
:
# 1. data.Field()
PAPERID = data.Field()
LABEL = data.LabelField()
TEXT = data.Field(tokenize='spacy')
# 2. data.TabularDataset
train_data, valid_data, test_data = data.TabularDataset.splits(path=dataset_path,
train="train.csv",
valid="valid.csv",
test="test.csv",
fields=[('paper_id', PAPERID),('label', LABEL), ('text', TEXT)],
format="csv")
# 3. data.BucketIterator
train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data, valid_data, test_data),
batch_size=BATCH_SIZE,
device=device,
sort_key=lambda x: len(x.text))
# 4. Build vocab
PAPERID.build_vocab(train_data)
TEXT.build_vocab(train_data)
# If you wanna use some pre-trained vectors, like GloVe, you could add
# Text.build_vocab(train_data, vectors="glove.6B.300d")
LABEL.build_vocab(train_data)
❓ Questions and Help
Description
1 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...] 2 ['lab1', 'lab2',...] ['this is sentence 1', 'this is sentence 2',...] etc How could I read this file? Any suggestion? I want to do sequence labeling on sentence level.