tshatrov / ichiran

Linguistic tools for texts in Japanese language
MIT License
299 stars 33 forks source link

Slow parsing speed #17

Closed sethzh closed 2 years ago

sethzh commented 2 years ago

This may be impossible to do anything about, but is there a way to increase the speed of the parser?

This is by far the most accurate Japanese parser I've seen, and I'm trying to use it to create a frequency list. I have tens of millions of sentences I want to analyze, but it took three days just to analyze 1 million. In comparison to something like mecab, mecab only took half an hour.

I am currently running the cli on each line, getting the JSON output, and working with that in NodeJS.

Let me know if there's anything I can do to speed up the process.

wolfv506 commented 2 years ago

Hi,

I ran into the same issue as you. I don't know if it helps, but I found an adhoc solution with python. I managed to reduce the parse speed significantly by using a multithreadpool with concurrent futures and 8-10 threads (see attachement for benchmarks) . I also found that processing somewhat larger chunks of text at a time is more time efficient than passing single lines. As a rule of thumb I would read in the txt of a whole book/novel then split the text into evenly sized chunks based on the amount of threads I used.

Compared to mecab it's still significantly slower but I guess that's just the trade-off between accurracy and time effiency

image

tshatrov commented 2 years ago

Well, this segmenter was designed for interactive usage so it's not particularly efficient (I did optimize some things but only until it was 'good enough'). The main algorithm is quadratic wrt the length of the sentence, however that only applies to completely unsegmented text (with no punctuations or spaces or non-japanese characters at all).

There's probably some overhead with calling cli on every sentence separately instead of bunching them together, separated by some sort of punctuation.

Also your approach seems to be single-threaded while Ichiran is thread-safe so the fastest approach would be native multithreading in Lisp, but launching cli in parallel should also work (although quite wasteful in terms of memory as each cli process is quite chunky).

sethzh commented 2 years ago

I should have mentioned that I already have the program spawning ichiran-cli concurrently with 8 workers.

I don't know much about concurrent or multi-threaded programming, so I'm not sure what the difference is.

Also, I tried processing larger chunks rather than single sentences, and the performance was a little slower than with single sentences.

You have to realize my program is held together by duct tape.

If I set this up on another machine I have of similar performance, and run them at the same time, it should take me half the time to get the results I need.

I have about 18,000,000 sentences I still want to cover, so that will take about 27 days with this method.

I also have to consider that since this is a frequency list, there are eventual diminishing returns. 10,000,000 instead of 1,000,000 will give me much more accurate frequency rankings, but 18,000,000 instead of 10,000,000 won't be nearly as useful. Limiting myself to 10,000,000 sentences would take about 2 weeks.

This is probably my best course of action other than upgrading my PC or figuring out how to do this in Lisp.

Thanks for the quick responses.

krackers commented 1 year ago

There's no fundamental limitation why this has to be slow, the DAG search should be linear time, the main cost is probably doing the dictionary lookups for the N^2 subsegments. For this you should probably use fancier data structures such as Trie or FST so that we can reduce the runtime. This part is shared with other lattice-based segmenters like MeCab, so you could directly borrow their implementation.

Dillpickleschmidt commented 1 month ago

Ichiran Parallelization in Lisp: Processing Large Datasets Efficiently

I've successfully implemented ichiran parallelization in Lisp for processing large amounts of data. This solution achieves 100% utilization across all cores, significantly speeding up text processing tasks.

(this graph is average across all cores) CPU Utilization

Please keep in mind that I touched Lisp for the very first time yesterday morning, so most of this is AI-generated code. Still, I had to spend 2 days debugging so this will likely still be a useful starting point for anyone wanting to use ichiran at a larger scale.

Input Format

This code assumes a CSV input format like this:

11,これさえあれば 俺は
11,本物の妖怪になれるんだ ヘヘッ
12,(桔梗(ききょう))犬夜叉っ
13,おわっ

The numbers are the incrementing ids from blocks of text from .srt transcripts grouped by timestamps. However, you can easily use Claude or Chat-GPT to modify the Lisp code to suit different inputs.

(original InuYasha.S01E001.WEBRip.Netflix.ja[cc].srt file that I transformed to csv using Rust) ` 11 00:01:56,616 --> 00:02:00,829 これさえあれば 俺は 本物の妖怪になれるんだ ヘヘッ

12 00:02:04,332 --> 00:02:05,625 (桔梗(ききょう))犬夜叉っ

13 00:02:10,463 --> 00:02:11,506 おわっ `

Output

This code segments words into text and kana, separated by a comma "," where each word is placed on a new line. You can also modify this output with Claude and Chat-GPT, using different ichiran commands to get different info. Refer to the ichiran installation guide for common commands. Also, check out the package.lisp file in the ichiran directory to see all the functions it outputs.

12,(,(
12,桔梗,ききょう
12,(,(
12,ききょう,(ききょう)
12,)),))
12,犬夜叉,いぬやしゃ
12,っ,っ
13,お,お
13,わ,わ
13,っ,っ

*Sometimes ichiran puts kana in parentheses. Sometimes there is more than 1 kana representation listed in parentheses, which are separated by Japanese space characters " " within the parentheses.

Setup and Usage Instructions

The goal is to create a new Lisp project that imports/uses the ichiran library functions for parsing.

  1. Follow the ichiran installation instructions.

  2. Create a new directory for this Lisp project placed next to the ichiran directory in quicklisp/local-projects.

  3. Create an ichiran-file-processor.lisp file and paste in the code described in the next section.

  4. In the terminal, cd to the directory you created (folder containing ichiran-file-processor.lisp).

  5. Run SBCL with increased dynamic space size:

    sbcl --dynamic-space-size 32768

    Important: Make sure to add the dynamic-space-size flag! Adjust the size based on your available memory (in MB). I have 64GB memory so half that just seemed safe, though this program doesn't ever seem to exceed 2GB of RAM usage due to streaming. Adding this flag is crucial to avoid random fatal memory errors (I spent a lot of time banging my head against a wall until I discovered that adding this flag fixed the crashes).

  6. Load the file:

    (load "ichiran-file-processor.lisp")
  7. Run the main function:

    (ichiran-file-processor:main "transcripts.csv")

    Replace "transcripts.csv" with your input file name (placed in the same directory as the ichiran-file-processor.lisp file).

Lisp Code

Here's the Lisp code that makes this possible:

(ql:quickload '(:ichiran :lparallel :cl-csv :bordeaux-threads))

(defpackage :ichiran-file-processor
  (:use :cl :ichiran :lparallel)
  (:export :main))

(in-package :ichiran-file-processor)

(defparameter *chunk-size* 10000 "Number of lines to process in each chunk")
(defparameter *progress-interval* 100 "Number of lines between progress updates")

(setf lparallel:*kernel* (lparallel:make-kernel 8)) ; Use 8 cores - CHANGE THIS FOR YOUR SYSTEM

(defvar *line-counter* 0)
(defvar *total-lines* 0)
(defvar *line-counter-lock* (bt:make-lock))

(defun split-csv-line (line)
  "Split a CSV line into number and text parts"
  (let ((comma-pos (position #\, line)))
    (if comma-pos
        (cons (parse-integer (subseq line 0 comma-pos) :junk-allowed t)
              (string-trim '(#\Space #\Tab) (subseq line (1+ comma-pos))))
        (cons nil line))))

(defun process-sentence (sentence line-number)
  "Process a single Japanese sentence using Ichiran and format as CSV rows"
  (handler-case
    (mapcar (lambda (w)
              (list (write-to-string line-number)
                    (ichiran/dict:word-info-text w)
                    (ichiran/dict:word-info-kana w)))
            (ichiran/dict:simple-segment sentence))
    (error (e)
      (list (list (write-to-string line-number)
                  sentence
                  (format nil "ERROR: ~A" e))))))

(defun update-progress ()
  "Update and print progress if necessary"
  (bt:with-lock-held (*line-counter-lock*)
    (incf *line-counter*)
    (when (zerop (mod *line-counter* *progress-interval*))
      (let ((percentage (* 100 (/ *line-counter* *total-lines*))))
        (format t "Processed ~D/~D (~,3F%)~%" *line-counter* *total-lines* percentage)))))

(defun process-chunk (chunk)
  "Process a chunk of lines"
  (loop for (number . text) in chunk
        when number
          append (prog1 (process-sentence text number)
                   (update-progress))))

(defun read-input-file (filename)
  "Read the input file and return a list of (number . text) pairs"
  (with-open-file (in filename)
    (loop for line = (read-line in nil nil)
          while line
          collect (split-csv-line line))))

(defun count-lines (filename)
  "Count the total number of lines in the file"
  (with-open-file (in filename)
    (loop for line = (read-line in nil nil)
          while line count 1)))

(defun process-file-parallel (input-filename output-filename)
  "Process the input file in parallel and write results to the output CSV file"
  (let* ((input-data (read-input-file input-filename))
         (chunks (loop for i from 0 below (length input-data) by *chunk-size*
                       collect (subseq input-data i (min (+ i *chunk-size*) (length input-data)))))
         (results (pmap 'list #'process-chunk chunks)))
    (with-open-file (out output-filename
                         :direction :output
                         :if-exists :supersede
                         :if-does-not-exist :create)
      (cl-csv:write-csv-row '("Line Number" "Word" "Reading") :stream out)
      (dolist (chunk-result results)
        (dolist (row chunk-result)
          (cl-csv:write-csv-row row :stream out))))))

(defun main (filename)
  "Main function to process a file and write results to output.csv"
  (let* ((directory (directory-namestring filename))
         (output-filename (merge-pathnames "output.csv" directory)))
    (format t "Processing file: ~A~%" filename)
    (format t "Output will be written to: ~A~%~%" output-filename)
    (setf *line-counter* 0)
    (setf *total-lines* (count-lines filename))
    (format t "Total lines to process: ~D~%~%" *total-lines*)
    (time
     (handler-case
         (process-file-parallel filename output-filename)
       (error (e)
         (format t "An error occurred: ~A~%" e))))
    (format t "~%Processing complete. Total lines processed: ~D~%" *line-counter*)
    (format t "Results written to ~A~%" output-filename)))

;; Example usage:
;; (ichiran-file-processor:main "path/to/your/input.csv")

Note that you'll want to adjust the number of cores (currently 8) to match your specific machine. I'm sure that there's a way to automatically detect it in lisp but I didn't want to spend more than 60 seconds trying to figure it out since I already knew the number for my machine.

You'll want to adjust the chunk size as well. Ask AI what's good for your machine given your amount of RAM, the number of lines you're parsing, and your CPU core count. Right now, the chunk size is set to 10,000. If you want to be super optimal, you'll want to divide your core count by the number of lines in your file since each thread handles its own chunk. You don't want this to be too big such that this number x the number of cores exceeds your RAM capacity. I just picked a random number (10,000, which is definitely a big underestimate) here but you'll want to tweak this for your use case. *From what I've gathered, each core processes a chunk at a time (maybe 2 with threads, idk), and when it's done, it processes the next chunk. With multiple cores, it's doing multiple different chunks at once. You don't have to worry about exactly matching chunk sizes with core count, just pick something reasonable and it'll loop through all your chunks until everything's parsed. Just don't pick something too big for small files or you'll allocate too much for a few cores and the rest will have nothing to do.

Speed

I ran this on a 150,000-row csv file containing Japanese subs from all Hunter x Hunter and Inuyasha episodes, where each row was a single line from the subs. It took just under an hour to complete. I'm running an AMD Ryzen 9 5900HX CPU. That means it parsed 41.67 lines per second.

If we assume each line contains an average of 6 words (rough guess), that's 250 words per second.

If we assume each word is an average of 4 characters (rough guess), that's 1000 characters per second.

Comparing to the chart from @wolfv506 : I parsed 3,600,000 characters (150,000 6 4) in 3,600 seconds (1 hour). 3,600/3,600,000 = 0.001 runtime per char which is 10x better than 0.01.

*Granted, a portion of that is because I likely have a faster CPU, but I doubt my CPU is 10x faster.

Given that I'm just parsing single lines at a time, this could be made even faster by processing larger blocks of text at once in ichiran like @wolfv506 recommended.

Also, I might not be comparing apples-to-apples here since I'm just doing text segmentation. Other's might've been using -f or -i flags which do dictionary lookups. I tested parsing "with info" and it took about 4x longer.

Notes

I hope this helps! :)