nlpyang / BertSum

Code for paper Fine-tune BERT for Extractive Summarization
Apache License 2.0
1.47k stars 423 forks source link

Problems with my own dataset and with the format_to_bert function. #97

Closed DarlineFiedler closed 4 years ago

DarlineFiedler commented 4 years ago

Maybe you can help me. I'm supposed to use my own data for my final paper using BertSum. These are title and abstracat pairs. So that I can get a title from the abstract. At the moment I'm stuck on the question where to insert my data set into the model. Furthermore I cannot open a .story file. I don't know exactly how the original data set is structured.

Maybe you can help me to customize BertSum.

DarlineFiedler commented 4 years ago

I've got a different problem now. It does not matter if I use the example file or my own. I always get the same message. When I use the -format_to_bert function I get the following error:

(base) D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src>python preprocess.py -mode format_to_bert -raw_path ../json_data -save_path ../bert_data -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log [('../json_data\cnndm_sample.train.0.json', Namespace(dataset='', log_file='../logs/preprocess.log', lower=True, map_path='../data/', max_nsents=100, max_src_ntokens=200, min_nsents=3, min_src_ntokens=5, mode='format_to_bert', n_cpus=4, oracle_mode='greedy', raw_path='../json_data', save_path='../bert_data', shard_size=2000), '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt')] multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\Programme\anaconda3\lib\site-packages\multiprocess\pool.py", line 121, in worker result = (True, func(*args, **kwds)) File "D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src\prepro\data_builder.py", line 273, in _format_to_bert torch.save(datasets, save_file) File "D:\Programme\anaconda3\lib\site-packages\torch\serialization.py", line 209, in save return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "D:\Programme\anaconda3\lib\site-packages\torch\serialization.py", line 132, in _with_file_like f = open(f, mode) FileNotFoundError: [Errno 2] No such file or directory: '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "preprocess.py", line 63, in eval('data_builder.'+args.mode + '(args)') File "", line 1, in File "D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src\prepro\data_builder.py", line 212, in format_to_bert for d in pool.imap(_format_to_bert, a_lst): File "D:\Programme\anaconda3\lib\site-packages\multiprocess\pool.py", line 748, in next raise value FileNotFoundError: [Errno 2] No such file or directory: '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt'

DarlineFiedler commented 4 years ago

I solved the upper problem. If I now execute the function I get two empty brackets. Is that right?

(base) D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src>python preprocess.py -mode format_to_bert -raw_path ../json_data -save_path ../bert_data -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log [('../json_data\cnndm_sample.train.0.json', Namespace(dataset='', log_file='../logs/preprocess.log', lower=True, map_path='../data/', max_nsents=100, max_src_ntokens=200, min_nsents=3, min_src_ntokens=5, mode='format_to_bert', n_cpus=4, oracle_mode='greedy', raw_path='../json_data', save_path='../bert_data', shard_size=2000), '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt')] [] []

progsi commented 4 years ago

empty brakets mean, that there is no input into some function, so it is not correct. I remember having the same problem, but I don't remember how I solved it. I suggest you using the debug mode though, by typing "-m pdb" before the command. Then you can print out variables and check whats going wrong.

DarlineFiedler commented 4 years ago

Thank you I tried "-m pdb" and got an AttributeError displayed. But I don't know what this tells me exactly. Or rather I do not know how to solve it.

The exact error is this: --Return--

D:\studium\bachelorarbeit\bachlorarbeit\bachelorarbeit\bert\bertsum\bertsum\src\preprocess.py(63)()->None -> eval('data_builder.'+args.mode + '(args)') (Pdb) next AttributeError: module 'main' has no attribute 'spec' < string >(1)()->None image

progsi commented 4 years ago

I think I remember now, solved it by copying the following line to the argparser arguments of the preprocess module. __spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"

DarlineFiedler commented 4 years ago

I add this line in preprocess.py but maybe in a wrong way. Or isn't the argparsers arguments in the preprocess.py.

Because i still get the empty brackets, but if i run the -m pdb, i didn't get a error. maybe you can show me the exact spot in the code where the line should go.

tschomacker commented 4 years ago

Hi @DarlineFiedler I am also currently writing my bachelor thesis on bertsum. For me the problem had something to do the way my json files from step 4 where named. maybe this comment: https://github.com/nlpyang/BertSum/issues/90#issuecomment-588604300 helps you

DarlineFiedler commented 4 years ago

I also get this empty brackets, If i try the "cnndm_sample.train.0.json", not only with my own json Data.

tschomacker commented 4 years ago

The empty brackets are indicating that no files where found. Try the absolute path and create a file for each category. e.g. cnndm_sample.valid.0.json , cnndm_sample.test.0.json , cnndm_sample.train.0.json . They could all be a copy of cnndm_sample.train.0.json

tschomacker commented 4 years ago

I have also encountered a few problems with the original BertSum. Because of this I switched to a fork of it: https://github.com/Santosh-Gupta/BertSum and based on this I have created my own fork: https://github.com/tschomacker/BertSum

DarlineFiedler commented 4 years ago

Thanks, that really helped me a lot. I actually just forgot to create the valid and test data.

namln2k commented 2 years ago

Hello @DarlineFiedler and @tschomacker , I'm working on this repo and need help to process my own dataset to test the model. I've followed the guidance and have completed the training with the preprocessed dataset. However when working on my own dataset, I'm stuck at this step: Step 4. Format to Simpler Json Files python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -map_path MAP_PATH -lower image As you can see, no output was printed. I think it's because of the /urls folder. I don't know what it means so can you help me?

tschomacker commented 2 years ago

As I have indicated previously: There is a fork https://github.com/Santosh-Gupta/BertSum and based on this I have created my own fork: https://github.com/tschomacker/BertSum . Both fixed this problem. As a starting point look at: https://github.com/tschomacker/BertSum/blob/master/src/prepro/data_builder.py#L246 . I hope this helps :)

namln-hust commented 2 years ago

@tschomacker I've figured the problem myself. However thanks very much!

hannanyi commented 2 years ago

As I have indicated previously: There is a fork https://github.com/Santosh-Gupta/BertSum and based on this I have created my own fork: https://github.com/tschomacker/BertSum . Both fixed this problem. As a starting point look at: https://github.com/tschomacker/BertSum/blob/master/src/prepro/data_builder.py#L246 . I hope this helps :)

@tschomacker hi, is there any pre-trained model for bertsum under your branch? If so, could you please send me a copy? It would be very useful to me. hannan@stumail.nwu.edu.cn