tomaarsen / SpanMarkerNER

SpanMarker for Named Entity Recognition
https://tomaarsen.github.io/SpanMarkerNER/
Apache License 2.0
397 stars 28 forks source link

SpanMarker library for document level context Gives Error. (RuntimeError: CUDA error: device-side assert triggered) #45

Open rudyrdx opened 11 months ago

rudyrdx commented 11 months ago

Gives this error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

FineTune Code:

from datasets import load_dataset, Dataset
dataset = load_dataset("json", data_files=["output.jsonl"])
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained(
    "bert-base-uncased",  # Example encoder
    labels=['O','Degree','Years_of_Experience','Email_Address'
        'College_Name','Location','Designation','Graduation_Year','Skills','Name'
        'Companies_worked_at'],
    max_prev_context=2,
    max_next_context=2,
)
from transformers import TrainingArguments
args = TrainingArguments(
    output_dir="models/RUDYRDX-NER-1",
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    push_to_hub=False,
    logging_steps=50,
    fp16=True,
    warmup_ratio=0.1,
)
from span_marker import Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
)
trainer.train() # error happens when this runs

Dataset Sample:

{"document_id": 0, "sentence_id": 0, "tokens": ["Govardhana", "K", "Senior", "Software", "Engineer", "Bengaluru", "Karnataka", "Karnataka", "-", "Email", "Indeed", ":", "indeed.com/r/Govardhana-K/", "b2de315d95905b68", "Total", "experience", "5", "Years", "6", "Months", "Cloud", "Lending", "Solutions", "INC", "4", "Month", "Salesforce", "Developer", "Oracle", "5", "Years", "2", "Month", "Core", "Java", "Developer", "Languages", "Core", "Java", "Go", "Lang", "Oracle", "PL-SQL", "programming", "Sales", "Force", "Developer", "APEX", "."], "ner_tags": ["Name", "Designation", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"document_id": 0, "sentence_id": 1, "tokens": ["Designations", "&", "Promotions", "Willing", "relocate", ":", "Anywhere", "WORK", "EXPERIENCE", "Senior", "Software", "Engineer", "Cloud", "Lending", "Solutions", "-", "Bangalore", "Karnataka", "-", "January", "2018", "Present", "Present", "Senior", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2016", "December", "2017", "Staff", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "January", "2014", "October", "2016", "Associate", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2012", "December", "2013", "EDUCATION", "B.E", "Computer", "Science", "Engineering", "Adithya", "Institute", "Technology", "-", "Tamil", "Nadu", "September", "2008", "June", "2012", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "SKILLS", "APEX", "."], "ner_tags": ["Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation"]}
{"document_id": 0, "sentence_id": 2, "tokens": ["(", "Less", "1", "year", ")", "Data", "Structures", "(", "3", "years", ")", "FLEXCUBE", "(", "5", "years", ")", "Oracle", "(", "5", "years", ")", "Algorithms", "(", "3", "years", ")", "LINKS", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/", "ADDITIONAL", "INFORMATION", "Technical", "Proficiency", ":", "Languages", ":", "Core", "Java", "Go", "Lang", "Data", "Structures", "&", "Algorithms", "Oracle", "PL-SQL", "programming", "Sales", "Force", "APEX", "."], "ner_tags": ["Name", "Name", "Name", "Designation", "Designation", "Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O"]}
{"document_id": 0, "sentence_id": 3, "tokens": ["Tools", ":", "RADTool", "Jdeveloper", "NetBeans", "Eclipse", "SQL", "developer", "PL/SQL", "Developer", "WinSCP", "Putty", "Web", "Technologies", ":", "JavaScript", "XML", "HTML", "Webservice", "Operating", "Systems", ":", "Linux", "Windows", "Version", "control", "system", "SVN", "&", "Git-Hub", "Databases", ":", "Oracle", "Middleware", ":", "Web", "logic", "OC4J", "Product", "FLEXCUBE", ":", "Oracle", "FLEXCUBE", "Versions", "10.x", "11.x", "12.x", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/"], "ner_tags": ["Name", "Name", "Designation", "Designation", "Designation", "Location", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O"]}
jackboyla commented 11 months ago

I believe the error is coming up because the ner_tags actually need to be ints. The error you see usually comes up because PyTorch encounters an indexing mismatch.

I had some trouble with this myself and found that mapping the string ner_tags to an ID fixed the issue.

When you instantiate a SpanMarker model, the config already creates this map for you, using the list of labels you provide. You can see it by calling model.config.__getattribute__("encoder")]"label2id"].

@tomaarsen I think this should be explicitly mentioned somewhere in the repo since the errors don't make it clear what's gone wrong when integers aren't provided.

rudyrdx commented 11 months ago

@jackboyla Thanks for letting me know I will try

rudyrdx commented 11 months ago

Hi, I went through my training data again and noticed that the spans were wrong. when I divided the data using word length, and then tried to generate ner tags for the respective sentences, the spans were not correct. the startings and endings NER tags were wrong for the sentences. apparantly I lack the brain power to think so I switched to Spacy and was able to achieve the ner (not token classification but sentence classification (paragraph)) . So now i want to try this with SpanMarker so i will update after trying whether the problem was numeric ids or somthing else.