Preproc - format_to_bert step

tahmedge / QR-BERTSUM-TL-for-QFAS

This is the official repository of the paper "Query Focused Abstractive Summarization via Incorporating Query Relevance and Transfer Learning with Transformer Models" accepted at Canadian AI 2020.

16 stars 0 forks source link

Preproc - format_to_bert step #4

Closed ahmed-moubtahij closed 3 years ago

ahmed-moubtahij commented 3 years ago

BertData::preprocess in data_builder.py has this check:

if ((not is_test) and len(src) < self.args.min_src_nsents):
            return None

When running the last preproc step for a given fold, len(src) evaluates to 2 and self.args.min_src_nsents has a default value of 3. The result is that I end up with invalid dp.train.#.bert.pt and dp.valid.#.bert.pt files (only the dp.test.0.bert.pt file is valid).

This problem is fixed if I assign 2 to the -min_src_nsents argument, but I'm not sure if this is the right solution. If it is, would you happen to have the list of parameters for the preproc steps?

tahmedge commented 3 years ago

Hi, I am a bit confused regarding what you are asking for. Can you please elaborate on what do you mean by the parameters for the pre-processing steps?

ahmed-moubtahij commented 3 years ago

That issue seems to be a symptom of another one; I noticed that after this line in data_builder.py/BertData::preprocess():

src_subtokens = [self.cls_token] + src_subtokens + [self.sep_token]

I get this kind of result (with print(' '.join(src_subtokens))):

[CLS] < ##s ##> virtues : is debate better for the emotions spirit virtues ? < ##eo ##s ##> [SEP] [CLS] < ##s ##> it is not a virtue to appeal to emotions . this should not ex ##cite us . calm civil ##ity is a greater virtue than excited debate . dialogue is the best way to adhere to these virtues while approaching difficult problems . < ##eo ##s ##> [SEP]

Preproc steps applied as is don't separate sentences in the source document, which leads to BERTSUM's input items consisting of only 2 sequences: [CLS]Q[SEP] [CLS]src_doc[SEP]. So all the sentences in the source doc are included within a single pair of tokens ([CLS]...[SEP]). This doesn't seem to correspond with the paper's description: Am I missing something? If not, how could I correct the preprocessing?

tahmedge commented 3 years ago

It should depend on how you constructed your "CNN-DM style story" files. If you prepared your story files via considering "one single line" as a document, then the whole line should be considered as a single sentence.

You may need to modify this command while utilizing the splits of the sentences in your dataset.

An example of how you should prepare your "cnn-dm formatted story dataset" is the following:

\ ~~QUERY \~~

\ ~~SENTENCE 1 \~~

\ ~~SENTENCE 2 \~~

\ ~~SENTENCE 3 \~~

@highlight

\ ~~SUMMARY \~~

ahmed-moubtahij commented 3 years ago

I had modified the provided create_cnndm_format_stories_with_query.py to do the right formatting (here is the code) and here is what a cnndm-style story now looks like:

judiciary overload : would the legalization of marijuana alleviate strained courts ? in america around # # people are arrested on marijuana related charges each year . this creates a massive strain on court systems and on prisons . the consequences of these strains are far-reaching including such problems as increased rates of plead bargaining . @highlight far too many people are imprisoned for the possession of cannabis

If I understand your example correctly, the <s> and <eos> tokens should be kept in the cnn-dm formatted story dataset? If so, what is the reason for that?

tahmedge commented 3 years ago

Hi, \ and \ tokens are not needed to be kept in the cnndm-style story (I just used them in the example since they were given in the Debatepedia dataset). And even during the evaluation, we removed these tokens as mentioned in the paper.

So you can exclude them. However, you must use a Blank Line in between two sentences as follows:

judiciary overload : would the legalization of marijuana alleviate strained courts ? in america around # # people are arrested on marijuana related charges each year . this creates a massive strain on court systems and on prisons . the consequences of these strains are far-reaching including such problems as increased rates of plead bargaining . @highlight far too many people are imprisoned for the possession of cannabis

ahmed-moubtahij commented 3 years ago

I since used a blank line in the cnndm formatting so I end up with styles like these:

fairness : is a university soccer playoffs fair ? michael davis & tim kane . `` would a college football playoff be fair ? '' real clear politics . november # # > `` an # -team playoff gives an undoubtedly weaker team the chance to defeating a team that was much better during the regular season . that may make for entertaining entertainment but it is definitely unfair in its way . '' @highlight football playoffs risk weaker teams getting lucky

nootka sound convention : did britain renounce its claim to south american islands ? great britain abandoned its settlement in # and formally renounced sovereignty in the nootka sound convention . argentina has always claimed the falklands and never renounced its claim . @highlight argentina always claimed the falklands ; britain once renounced its claim .

They seem to fit the requirements (are the pounds characters supposed to stay in?) but these are the ROUGE scores results I end up with: Where the precision scores are quite far off from the paper's. Do you have an idea of what else I could change or look into?

tahmedge commented 3 years ago

The probable reason for the lower Precision score is because of generating summaries much longer than the requirement. What is the value of the following parameters of your model when you generate the summary of your test data using the evaluation script: (a) min_length and (b) max_length?

ahmed-moubtahij commented 3 years ago

When executing src/train.py with -mode validate: -min_length 20 -max_length 100

tahmedge commented 3 years ago

Please run the evaluation script again with Min_Length=5 and Max_Length=25. As far as I can remember, this was our evaluation script's parameter for Debatepedia. Let me know if you can reproduce the result mentioned in the paper successfully or not.

ahmed-moubtahij commented 3 years ago

That did it! I get better precision and thus F1 scores now, even slightly higher than the paper's: A result sample: raw_src:

[CLS] artists : could they benefit from the ban ##s on download ##ing ? [SEP] [CLS] each and each download means losing revenues for the artist . [SEP] [CLS] worse still if people are can to download entire cds or films for free they have no incentives to buy the original version . [SEP][PAD]...[PAD]

gold_summary: downloaded equal losing revenues .

candidate_summary: downloading equal losing revenues .

Some results are incoherent w.r.t the query or grammatically incorrect, but so are some of the gold summaries in the Debatepedia dataset. Anyhow, thank you for your assistance!

tahmedge commented 3 years ago

Great to know that it works.

© Githubissues.

Githubissues is a development platform for aggregating issues.