huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.75k stars 25.77k forks source link

CSV/JSON file format for examples/token-classification/run_ner.py #8698

Closed ganeshjawahar closed 3 years ago

ganeshjawahar commented 3 years ago

Environment info

Who can help

@mfuntowicz, @stefan-it

Information

Model I am using (Bert, XLNet ...): XLM-R

The problem arises when using:

The tasks I am working on is:

https://github.com/huggingface/transformers/tree/master/examples/token-classification

python run_ner.py \
  --model_name_or_path bert-base-uncased \
  --train_file path_to_train_file \
  --validation_file path_to_validation_file \
  --output_dir /tmp/test-ner \
  --do_train \
  --do_eval

I am trying to perform ner on custom dataset. It's not clear what's the format of path_to_train_file and path_to_validation_file. From the code, it seems that the file format should be csv or json. Can you please give more details on this so that I can format my dataset accordingly?

Thanks.

stefan-it commented 3 years ago

Hi @ganeshjawahar , please have a look at the run_NER_old.py script! It should handle custom files 🤗

stefan-it commented 3 years ago

Usage and more examples are documented here:

https://github.com/huggingface/transformers/tree/master/examples/token-classification#old-version-of-the-script

ganeshjawahar commented 3 years ago

Thanks for the quick response. I'm able to make use of run_ner_old.py with my custom dataset. Is there a similar documentation to use run_ner.py with custom dataset?

P.S.: run_ner_old.py loads all examples into RAM and that's a problem for me as my custom dataset is very large. I was thinking of getting around this issue by using run_ner.py which uses datasets library.

ganeshjawahar commented 3 years ago

If you can provide a tiny example for csv or json format, that should be very helpful. 🤗

stefan-it commented 3 years ago

Ah, I see, an example for a json-based file format can be found here:

https://github.com/huggingface/transformers/blob/master/tests/fixtures/tests_samples/conll/sample.json

Another possibility would be, that you write a custom recipe with Hugging Face datasets library. Then you can run the run_NER.py script by passing the (local) path name of your recipe to the script. Just have a look at the CoNNL dataset/recipe:

https://github.com/huggingface/datasets/blob/master/datasets/conll2003/conll2003.py

You could usw it as a template and modify it for your needs 🤗

gpiat commented 3 years ago

I think the JSON sample should be in the token-classification README for people trying to use run_ner.py from local files. Would you also be willing to provide a CSV sample? So far, I have found through trial, error, and code deciphering that:

Right now, my CSV file looks like this:

token,label
DC,M                                                                         
##T,M                                                                        
##N,M                                                                        
##4,M                                                                        
as,O                      
a,O                                    
m,O                                                                          
##od,O                                                                       
##ifier,O
...

I get the following error:

File "projects/github/transformers/examples/token-classification/run_ner.py", line 221, in main
    if isinstance(features[label_column_name].feature, ClassLabel):
AttributeError: 'Value' object has no attribute 'feature'

Using the python debugger, I've found that features[label_column_name] = Value(dtype='string', id=None) but I don't know if this is expected behavior. I can only assume that it isn't, but I can't seem to figure out what else features[label_column_name] could or should be.

I'm pretty much stuck, and knowing if the issue comes from the structure of my CSV would be very helpful.

Furthermore, I've tried formatting my data as close as I could to the JSON conll sample, but I get the following error:

json.decoder.JSONDecodeError: Extra data: line 2 column 1

After a little bit of googling, as I suspected it turns out one cannot have multiple JSON objects in one file. So if the intended JSON format for run_ner.py requires one JSON object per sequence but JSON files can't contain more than one JSON object, how can we get run_ner.py to work with several sequences in JSON mode?

millanbatra1234 commented 3 years ago

Exact same process/issue/errors as @gpiat. Would be very helpful if the format for the csv option for run_ner.py was explicitly defined in the readme. If there was a sample input for the csv option that is fully functional with the script it would be much more simple to modify our custom data to match the sample as opposed to writing a custom recipe.

AleksandrsBerdicevskis commented 3 years ago

Same problem as @gpiat with CSV. @stefan-it And it seems the old script is no longer available?

jeremybmerrill commented 3 years ago

I believe I've solved the same problem as @gpiat , @millanbatra1234 and @AleksandrsBerdicevskis have had:

Replace the if isinstance(features[label_column_name].feature, ClassLabel): in run_ner.py with if hasattr(features[label_column_name], 'feature') and isinstance(features[label_column_name].feature, ClassLabel):.

I tried @gpiat's CSV format and that doesn't work. Instead, I used the JSON format, which looks like this:

{"tokens": ["APPLICATION", "and", "Affidavit", "for", "Search", "Warrant", "as", "to", "The", "Matter", "of", "the", "Search", "of", "9", "Granite", "Street", ",", "#", "5", "(", "Attachments", ":", "#", "1", "Affidavit", "of", "James", "Keczkemethy)(Belpedio", ",", "Lisa", ")", "(", "Entered", ":", "12/15/2020", ")"], "tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "L-MISC", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"tokens": ["APPLICATION", "for", "Search", "Warrant", "by", "USA", "as", "to", "702", "-", "517", "-", "7282", "(", "KM", ",", "ilcd", ")", "(", "Entered", ":", "12/10/2020", ")"], "tags": ["O", "O", "O", "O", "O", "O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "L-MISC", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"tokens": ["APPLICATION", "AND", "AFFIDAVIT", "by", "USA", "as", "to", "4", "CELLULAR", "TELEPHONES", "SEIZED", "FROM", "THE", "FDC", "IN", "PHILADELPHIA", "AND", "CURRENTLY", "HELD", "BY", "THE", "FBI", "PHILADELPHIA", "DIVISION", "Re", ":", "Search", "Warrant", "Issued", ".", "(", "mac", ",", ")", "(", "Entered", ":", "12/09/2020", ")"], "tags": ["O", "O", "O", "O", "O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "L-MISC", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}

So, yes, you can have more than one JSON object in the file. Each JSON object goes on its own line. This is sometimes called JSONL or JSONLINES format.

AleksandrsBerdicevskis commented 3 years ago

@jeremybmerrill Thanks! Yes, JSON does work, I should have mentioned that (it actually does even without changing the code as you suggest).

(With JSON, I run into another input problem (#9660), but I guess that's a different story.)

denis-gordeev commented 3 years ago

In my case the json format didn't work due to this issue github.com/huggingface/datasets/issues/2181. Pyarrow can't handle json if the line size is too big. So I had to split large lines into smaller ones.

jzhw0130 commented 2 years ago

JSON works, but CSV still does not work now

savasy commented 2 years ago

CSV file input does not work! I converted it into JSON, so It works now.