when the script on the inputs that include non-English text will return error

abc3436645 commented 2 months ago

When the input contains non-English text, the script will output an error message, but it will continue running without crashing. Nevertheless, when I execute this script on a large dataset, it is unable to process the entire dataset.

thread '' panicked at src/extraction.rs:466:25: byte index 62 is not a char boundary; it is inside 'ö' (bytes 61..63) of "Two kinds of mechanical valve, St. Jude Medical (SJM) and Björk-Shiley (B-S), in patients with single valve replacement have been evaluated on a view point of intravascular hemolysis. note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

praise2112 commented 2 months ago

I've added support for non-English chars, a test case for this, and some minor improvements to prevent panics. Let me know if you run into any other issues.

abc3436645 commented 2 months ago

I've updated to version 0.1.2, and the non-English issue has been resolved, but the problem of freezing while loading data still exists, and no error is reported. You can try running it with a larger dataset to reproduce this issue.

my script is below:

import sys
import json

#from abbreviations import schwartz_hearst
from abbreviation_extractor import extract_abbreviation_definition_pairs

if __name__ == "__main__":
    try:
        for line in sys.stdin:
            #abbrec_res = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=line)
            abbrec_res = extract_abbreviation_definition_pairs(line,most_common_definition=True, tokenize=True)
            if abbrec_res:
                for pair in abbrec_res:
                    print(json.dumps({str(pair.abbreviation):str(pair.definition)},ensure_ascii=False)+"\n")
    except:
        raise

praise2112 commented 2 months ago

I've improved the error handling and made several changes to some methods which have been updated in the README

I noticed the code you provided reads a text file and calls extract_abbreviation_definition_pairs for each line.

I made an optimized function extract_abbreviations_from_file which takes in a path to a file. It:

Reads the file as a buffer so no memory issues for large file
Reads in chunks (1MB - can be changed) rather than line by line, this way we take advantage of tokenization and don't miss definitions broken up by a new line.
It uses multi-threading to process multiple chunks in parallel
most_common_definition and first_definition work across all definitions found in the file
Handles cases where a chunk errors out, this way it doesn't break everything, you should still be able to view errors in results.errors

Here is an example usage based on your code:

import json
from abbreviation_extractor import extract_abbreviations_from_file

input_file = "pubmed_abstracts_20240809.txt"

result = extract_abbreviations_from_file(input_file, most_common_definition=True, tokenize=True, show_progress=True)
print(f"Found {len(result.extractions)} abbreviations. Number of failed extractions: {len(result.errors)}")

json_res = [{pair.abbreviation:pair.definition} for pair in result.extractions]

with open(input_file.replace(".txt","_abbreviations.json"), 'w') as f:
   json.dump(json_res, f, ensure_ascii=False, indent=4)

I ran on a larger dataset of about 100 MB of abstracts, you can find it in benches/pubmed_abstracts_20240801_to_20240809.txt I did not encounter freezing issues, probably because I do not have the data you are working with. Let me know how it goes this time

praise2112 / abbreviation-extractor

when the script on the inputs that include non-English text will return error #1