Open abc3436645 opened 2 months ago
I've added support for non-English chars, a test case for this, and some minor improvements to prevent panics. Let me know if you run into any other issues.
I've updated to version 0.1.2, and the non-English issue has been resolved, but the problem of freezing while loading data still exists, and no error is reported. You can try running it with a larger dataset to reproduce this issue.
my script is below:
import sys
import json
#from abbreviations import schwartz_hearst
from abbreviation_extractor import extract_abbreviation_definition_pairs
if __name__ == "__main__":
try:
for line in sys.stdin:
#abbrec_res = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=line)
abbrec_res = extract_abbreviation_definition_pairs(line,most_common_definition=True, tokenize=True)
if abbrec_res:
for pair in abbrec_res:
print(json.dumps({str(pair.abbreviation):str(pair.definition)},ensure_ascii=False)+"\n")
except:
raise
I've improved the error handling and made several changes to some methods which have been updated in the README
I noticed the code you provided reads a text file and calls extract_abbreviation_definition_pairs
for each line.
I made an optimized function extract_abbreviations_from_file
which takes in a path to a file. It:
most_common_definition
and first_definition
work across all definitions found in the fileresults.errors
Here is an example usage based on your code:
import json
from abbreviation_extractor import extract_abbreviations_from_file
input_file = "pubmed_abstracts_20240809.txt"
result = extract_abbreviations_from_file(input_file, most_common_definition=True, tokenize=True, show_progress=True)
print(f"Found {len(result.extractions)} abbreviations. Number of failed extractions: {len(result.errors)}")
json_res = [{pair.abbreviation:pair.definition} for pair in result.extractions]
with open(input_file.replace(".txt","_abbreviations.json"), 'w') as f:
json.dump(json_res, f, ensure_ascii=False, indent=4)
I ran on a larger dataset of about 100 MB of abstracts, you can find it in benches/pubmed_abstracts_20240801_to_20240809.txt
I did not encounter freezing issues, probably because I do not have the data you are working with.
Let me know how it goes this time
When the input contains non-English text, the script will output an error message, but it will continue running without crashing. Nevertheless, when I execute this script on a large dataset, it is unable to process the entire dataset.
thread '' panicked at src/extraction.rs:466:25:
byte index 62 is not a char boundary; it is inside 'ö' (bytes 61..63) of
"Two kinds of mechanical valve, St. Jude Medical (SJM) and Björk-Shiley (B-S), in patients with single valve replacement have been evaluated on a view point of intravascular hemolysis.
note: run withRUST_BACKTRACE=1
environment variable to display a backtrace