techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
681 stars 35 forks source link

Special support of tech.ml.dataset to store "text corpora" efficiently ? #102

Closed behrica closed 4 years ago

behrica commented 4 years ago

I am wondering if the abstractions present in tech.lml and tech.ml.datatypes would allow to have a specific form of "column type" in tech.ml.dataset which store larger texts very efficiently (and not as java.lang.String). A bit of background for my use case.

I deal with NLP and a lot of times I have large tables (millions of rows) in which one column is "long text". It could be up to thousands of words per table cell.

int a int b .. ... large text
2 4 .. ... this is long text
3 5 .. .. this is another long text

This would be far more efficient to store, using a vocabulary :

id word
0 this
1 is
2 long
3 text
4 another

and then

int a int b .. ... large text
2 4 .. ... 0,1,2,3,4
3 5 .. .. 0,1,2,4,3

I would imagine, that it would reduce drastically the space need to store a text corpora ... 90 % less (Of couse for the price of "slower" reads and writes to the table)

Ideally this should be "transparently handled" by the "table object".

So reading such a specific column "large text" would reconstruct the String and give me the string back and writing to it would "re code the strings as arrazs of int" and eventually enlarging / shrinking the vocabulary.

I would see the "vocabulary" as "attached" to a table, but maybe it should be possible to share the same vocabulary between different tables.

Implementing this well could facilitate the usage of Clojure (and tech.ml.dataset ) for various NLP tasks.

My key question here would be, if the existing "abstractions" in the tech.ml universe would allow to implement this "completely transparent"

behrica commented 4 years ago

I just read that you support " strings are loaded into string tables."

I have seen this before, and just want to underline that this is not the same then my question. I suppose you "encode" cells with the "same" text into ints. This saves space for categorical variables.

This does not help for my use case, as every text cell is different. So the "string table" will be as large as the text table as such.

behrica commented 4 years ago

I think that spacy implments something like this internally. https://spacy.io/api/vocab

It has the concepts of

Vocab StringStore


The central data structures in spaCy are the Doc and the Vocab. The Doc object owns the sequence of tokens and all their annotations. The Vocab object owns a set of look-up tables that make common information available across documents. By centralizing strings, word vectors and lexical attributes, we avoid storing multiple copies of this data. This saves memory, and ensures there’s a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.


Of course, spacy is a library specifically for NLP. So it could be argued, that somebody should implement a clone of spacy in Clojure outside "tech.ml"

But maybe the "underlying data structure" of efficient storage of large texts could still have a place here.

Just let me know, what you think.

cnuernber commented 4 years ago

I think it is totally reasonable to have a better text storage mechanism specifically designed for paragraphs of text. Currently if a string is longer than a certain amount the system falls back to 'text' which currently simple means the column stores a string.

For text specifically we could split on spaces and insert the resulting strings into the string table and then have an offset and length recorded each row of the column data. I think this might get you the space savings; what do you think? There is no explicit text datatype yet so it really depends on how sophisticated the underlying support needs to be in the dataset mechanism.

Aside from storing the strings efficiently, are there a small set of core operations that would be expected in the dataset library for manipulating strings that are text specific or is just having a column of text objects that have efficient conversion to sequences of strings with an implied space in between sufficient?

Also, do you have an example canonical dataset?

It seems like there could be quite a lot here so I am hoping for a very very small first step. Maybe just text objects with a shared string table per column so you can load the data efficiently is fine and you can take it from there.

cnuernber commented 4 years ago

As an aside, do you know how spaCy compares to openNLP?

It seems a large jump up from both would be gluonnlp from the mxnet project.

I know nothing really about this space; just reading around. I am a fan of the gluonnlp project.

behrica commented 4 years ago

You mentioned already the rights concepts to think about:

  1. We would talk about a "special column type" for a certain type of Strings namely Strings which consists in sequences of tokens separated my one or more separators
  2. This String represent typically "human generate text", and in here the separator between the token is "space" (mostly)
  3. The process of "tokenisation" should be ideally configurable, as it is problem domain dependent
  4. Yes, there should be somewhere a string table and then pointers into it The table could be a map (int->String) or even one large string and then user pointer + length
  5. The functions of reading data from csv should ideally do this transformation while reading CSV files. This would allow to access "large" text files, with less memory pressure

I would not expect specific operations on these "human text columns" . We should concentrate on the efficient storage and the transparent transforming between "java.util.String" while going in and out of the dataset.

This goes a bit in the direction of the special handling of date/time you do already

cnuernber commented 4 years ago

That is pretty clear and should be doable. Getting a good storage format and loading a few large datasets is a solid first step; we can reevaluate after that.

cnuernber commented 4 years ago

OK, working through this. Simply splitting on spaces and storing the resulting hashtable for that column increases the dataset size by a factor of 4+.

Simply loading the column increases the dataset size by a factor of over 2 which I guess is the result of encoding the strings in utf-16 as opposed to utf-8.

Potentially a factor of 2 is doable but if you want a similar/smaller dataset size we would have to encode the data in UTF-8 or have a better tokenizer that did proper stemming, punctuation removal and lowercase the result.

The quick path is for text to just store the string with no string table. Then I believe you get only a factor of 2 larger dataset than the original data.

There could be further support for UTF-8-text or perhaps text should always be stored this way but those are getting to be larger changes with ramification for performance and interactions with hashtables and such.

cnuernber commented 4 years ago

Encoding the abstract in UTF8 reduces the loaded size from 450KB to 350KB. This is still twice as large as the uncompressed file on disk. Tough problem for java, really. The rest of the string columns together don't amount to that much so tracking this down and actually getting to 1->1 in-memory to on-disk size is tough for text corpora without more effort.

Fixing the bug for keeping the column as a string column and loading it as a string with a string table really doesn't increase the file size much beyond loading the column as utf-16 java strings, 460KB as compared to 450KB.

behrica commented 4 years ago

Maybe the problem is too much context specific. And a "default" tokenization does not seem to be very efficient.

behrica commented 4 years ago

I am nevertheless wondering, how I can do "something", when I have a lets say, 2 GB cvs file, in which one of the columns is large texts and I want to use tech.ml.dataset for exploratory data analysis.

Currently I have 2 options, assuming that loading the whole file will kill my memory.

  1. I can load some rows only (using ":num-rows) - good
  2. I can skip the text column - good

    Maybe a form of "sampling" of rows while reading the file could be nice (instead of taking the "head" as sample)

This allows me to play with the data.

At a certain moment I played enough, maybe even wrote some code to analyze my sample texts.

But know i want to run it against the full data file. What options do I have now in tech.ml.dataset ?

Loading the full file will fail , so I am a bit stuck.

At this point in time, I would maybe "know" precisely what from of "text representation" I need.

But i cannot implement this "text transformation" on the full data set, even if it would reduce drastically the information (and the needed memory), example: I decide to only keep the most frequent 10000 token, so I go to a bag-of-words representation.

Ideally I would be able to "reduce" the needed memory for the text, while I read it in. Mabey this could be achieved by a kind of "call back" mechanism, which calls into my code while it reads the file.

I first though that the options in :parser-fn allow exactly this, but they don't.

So maybe it would be usefull to have a way to register a custom parse function which gets called (if given), with the text of the current csv cell.

In this function I could then decide, what I do with the text. Whatever this function returns, will be stored in the dataset column

behrica commented 4 years ago

In some scenarios I can work with the large file in "blocks", this would require to have not only a :num-rows options, but as well a ":start-row" option.

behrica commented 4 years ago

I am just seeing, that "the callback" I mentioned before is already there. So I can use the current features of tech.ml.dataset to convert rather large csv into a bag-of-words: The follwoing is the typical bag-of-word construction which I would do.

This should work on rather larger dataset, as I "both" passes over the csv, I don't keep the full text. The first pass just creates a token frequency table, which is used in the second pass to encode each text as a bag-of-words using ints.


(ns cord19.test
  (:require  [clojure.test :as t]
             [tech.ml.dataset :as ds]
             [tech.ml.dataset.parse :as parse]
             [tech.v2.datatype :as dtype]
             )
  (:import [tech.ml.dataset.parse PColumnParser]
            [smile.nlp.tokenizer SimpleTokenizer]
           )
  )
(defn sha256 [string]
  (let [digest (.digest (java.security.MessageDigest/getInstance "SHA-256") (.getBytes string "UTF-8"))]
    (apply str (map (partial format "%02x") digest))))

(defonce tokenizer (SimpleTokenizer.))

;;  go once over dataset to create token->id table

(def all-freqs* (atom (hash-set)))
(def all-shas* (atom (dtype/make-container :list :string  0)))

(defn f-count-freqs [column-name-or-idx column-data]
  (reify PColumnParser
    (parse! [this str-value]
      (let [tokens (seq (.split tokenizer str-value))

            ]
        (reset! all-freqs* (merge-with + (frequencies tokens) @all-freqs*))
        (.add ^java.util.List @all-shas* (sha256 str-value)))

      )

    (missing! [parser])
    (column-data [parser]
      {:data @all-shas*
       :missing []}
      )
    ))

(def dummy
  (ds/->dataset "./data/2020-06-29/metadata.csv" { :num-rows 10
                                                  :column-whitelist ["abstract"]
                                                  :parser-fn {"abstract" f-count-freqs}
                                                  :max-chars-per-column 1000000}))

;;
;;
;; keep most frequent tokens

(def most-freq-words
  (take 1000
        (reverse
         (sort-by second
                  @all-freqs*))))

;;  go second time over dataset and use look up table to store indexes only

(def token-lookup-table (zipmap
                          (keys most-freq-words)
                          (range)
                          ))

(def sha-ids-map* (atom (hash-map)))

(def all-shas* (atom (dtype/make-container :list :string  0)))

(defn f-tokenize [column-name-or-idx column-data]
  (reify PColumnParser
    (parse! [this str-value]
      (let [tokens (seq (.split tokenizer str-value))
            ids (into (hash-set) (remove nil? (map token-lookup-table tokens)))
            sha (sha256 str-value)
            ]
        (.add ^java.util.List @all-shas* sha)
        (reset! sha-ids-map* (assoc @sha-ids-map* sha ids)))
      )

    (missing! [parser])
    (column-data [parser]
      {:data @all-shas*
       :missing []}
      )
    ))

(def ds
  (ds/->dataset "./data/2020-06-29/metadata.csv" { :num-rows 10
                                                  :parser-fn {"abstract" f-tokenize}  :max-chars-per-column 1000000}))

(def bag-of-words
  @sha-ids-map*)

(def bag-of-words-ds
  (ds/->dataset
   (flatten
    (map
     (fn [[document token-ids]]
       (println token-ids)
       (map
        (fn [document token-id]
          {:document document
           :token-id token-id}

          )
        (repeat (count token-ids) document)
        token-ids))
     bag-of-words))))
behrica commented 4 years ago

Maybe String compression could work : https://github.com/lithedream/lithestring

cnuernber commented 4 years ago

That is pretty interesting. So you want one pass over just a few columns to generate your frequencies table and then another pass to keep just the highest frequency items.

You keep track of the sha-id and map sha-id to the set of tokens you keep. You final dataset is a flattened dataset where document is the sha-hash of the input and token-id is an index into a table of string tokens.

So really each abstract becomes is a sparse vector into a token space.

On option is that you could stay in string space completely and bypass all of the parsing mechanisms. This also keeps things lazy at this point.

user> (require '[tech.ml.dataset.parse :as ds-parse])
nil
user> (take 10 (ds-parse/csv->rows "test/data/medical-text.csv" 
                                   {:column-whitelist ["abstract"]}))
(["abstract"]
 ["OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infection ..."]

The use of the smile tokenizer was a very good move. I love HIafeng's work.

Looking at everything you did I might parse the document once with the normal dataset mechanisms to make the master token table and generate the sha hashes in place of the abstracts. You can then do a group-by-column operation to get a reverse map from sha-hash to rows of the table that map to that hash.

For the second pass I would use the technique above and use a pmap operation to re-tokenize the data and just keep the tokens you care about in a map from sha hash to token set like what you have.

Also, this is hard to explain in documentation but you can specify a custom parse function and a column datatype together. This means your parse function is simpler than an implementation of PColumnParser as it really can just be a normal clojure function. From this function you can return either a value of the datatype specified in the column datatype or the special keywords :tech.ml.dataset.parse/missing or :tech.ml.dataset.parse/parse-failure. The difference between missing and parse-failure is that parse failures are recorded in the column metadata as will as in the missing data. This allows you to see failing values after-the-fact.

I added a namespace showing a potential pathway to a full bag of words where you get a dataset with sha-hashes in a particular column and a master token table. You can then figure out your vocabular by trimming and manipulating the master token table and then there is a second step to, given token->integer map, re-parse the file using only the low-level CSV mechanism and produce your final document->token-idx table. From there you can group-by-column on either column to get maps back and forth.

Here is a proposed pathway to build a bag of words using concepts from above:

https://github.com/techascent/tech.ml.dataset/blob/f57fe86741fd0c0158a643c946404b78bd09c90a/src/tech/ml/dataset/text/bag_of_words.clj#L89

cnuernber commented 4 years ago

I guess from there you are doing sparse math to find cosine or regular distance. My recommendation here is to use something like ojAlgo or one of the libraries indicated here:

https://java-matrix.org/

ojAlgo reportedly has excellent performance while the person who built the java-matrix page publishes ujmp.

behrica commented 4 years ago

Thanks for your comments and examples, very useful I think the existing features in tech.ml are currently "right" and I think it is ready for be used for NLP.

Maybe 2 little things to think about and then handle as individual issues:

  1. start reading the file from a given "starting row". This would allow to work on larger then memory files in blocks of lines
  2. a "sample" function, which read x lines at random positions
  3. try if any form of lossless compression of string columns can save memory. https://github.com/lithedream/lithestring and if yes, allow to read texts in this form

Then we can close this general issue and discussion from my point of view, if you agree.

cnuernber commented 4 years ago

1,2 are totally reasonable I am a fan.

3 - This seems not as necessary just now. It isn't a large jump past encoded files, however, in terms of complexity. I think we should split that off into a separate issue - investigate string compression for large text datasets.

behrica commented 4 years ago

Ok, I create issues for 1) and 2)

3) still comes from the "default" behavior of R, which in this regard is for far superior of java/clojure:

I can read a 12.GB CVS file on disk in couple of seconds (30) into R without any issue. I have only 16 GB of RAM.

> df=readr::read_csv("big.csv")
Parsed with column specification:
cols(
  cord_uid = col_character(),
  sha = col_character(),
  source_x = col_character(),
  title = col_character(),
  doi = col_character(),
  pmcid = col_character(),
  pubmed_id = col_double(),
  license = col_character(),
  abstract = col_character(),
  publish_time = col_character(),
  authors = col_character(),
  journal = col_character(),
  mag_id = col_logical(),
  who_covidence_id = col_logical(),
  arxiv_id = col_logical(),
  pdf_json_files = col_character(),
  pmc_json_files = col_character(),
  url = col_character(),
  s2_id = col_logical()
)
|================================================================| 100% 11475 MB

R then tells me, that in RAM it is only 1.76....

Can we do this in Java / Clojure ?

Java > 9 has now "compact strings", so at least for latin characters it should be somehow possible.

cnuernber commented 4 years ago

That is impressive for sure. Right now we are at least doubling the size of the file even with utf-8 data although that is probably because I didn't join the strings together into one buffer. Maybe that is something to look at actually both encoding the strings differently and storing them in a particular way.

It is certainly possible but we need to figure out how exactly R does it. This will be more efficient than taking sort of random shots in the dark. This is with the R package data.table?

cnuernber commented 4 years ago

No string encoding is going to get 12-1 compression so there is something else going on there like a mmap'd file or something.

joinr commented 4 years ago

Just a thought re the text storage stuff and tokenizing, fastutil I think was built for that purpose originally and has a bunch of support for text storage and manipulation (including alternatives to strings). May be worth a look since its a fundamental dependency.

behrica commented 4 years ago

Just a thought re the text storage stuff and tokenizing, fastutil I think was built for that purpose originally and has a bunch of support for text storage and manipulation (including alternatives to strings). May be worth a look since its a fundamental dependency.

Yes, this looks very promissing. This allows to use "byte arrays" to store the long strings.

I am current evaluating some code which uses the usual csv parser to parse the rows, and then stores the bytes of the "long String" (per row) as a "huge byte array" (using ByteBigArrayBigList) + the indexes where the String starts.

cnuernber commented 4 years ago

I believe the only way to load that entire file (12+GB) on your laptop is to mmap and parse the file and your 'strings' are just pointers into the file. This is similar to you ByteBigArrayBigList approach except the file itself is your 'big list'. I don't think there exists a good mmap-based csv parser for java but that is the only thing, IMO, that will explain R loading the file and taking 1GB of ram. It stored 1GB of long offsets (or even chunked it so it stored a few long offsets and then 1GB of integer offsets) plus any primitives it loaded into arrays or something like that.

behrica commented 4 years ago

I got it cleanly working with fastutil:

(defmacro when-let*
    ([bindings & body]
     (if (seq bindings)
       `(when-let [~(first bindings) ~(second bindings)]
          (when-let* ~(drop 2 bindings) ~@body))
       `(do ~@body))))

  (def iterable
    (ds-parse/raw-row-iterable
     "/home/carsten/tmp/big.csv" ;; 12G, the text column allone is 8 GB
     (ds-parse/create-csv-parser {:max-chars-per-column 1000000}))

    )

  (def iterator ( .iterator iterable))
  (def current-index (atom 0))
  (def bl (ByteBigArrayBigList. ))
  (def index-list (LongArrayList.))
  (time
   (while  (.hasNext iterator)
     (when-let* [next-string (aget (.next iterator) 8)
                 bytes (.getBytes next-string "UTF-8")
                 len (count bytes)

                 ]
       (.add index-list @current-index)
       (swap! current-index + len)
       (.addAll bl (ByteArrayList. bytes)))
     ))
  ;; "Elapsed time: 219876.175441 msecs"
  (.size64 bl)
  ;; => 8310388458
  (mm/measure bl)
  ;; => "8.5 GB"
  ;; (.size index-list)
  ;; => 6134401
  (mm/measure index-list)
  ;; => "46.9 MB"
  (.subList index-list 0 5)
  ;; => (0 8 1855 2870 4517)
  (String.
   (.toByteArray (.subList bl 8 1855))
   "UTF-8"
   )
  ;; => "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and malaise (58.8%) were the most common symptoms, and crepitations (60%), and wheezes (40%) were the most common signs. Most patients with pneumonia had crepitations (79.2%) but only 25% had bronchial breathing. Immunocompromised patients were more likely than non-immunocompromised patients to present with pneumonia (8/9 versus 16/31, P = 0.05). Of the 24 patients with pneumonia, 14 (58.3%) had uneventful recovery, 4 (16.7%) recovered following some complications, 3 (12.5%) died because of M pneumoniae infection, and 3 (12.5%) died due to underlying comorbidities. The 3 patients who died of M pneumoniae pneumonia had other comorbidities. CONCLUSION: our results were similar to published data except for the finding that infections were more common in infants and preschool children and that the mortality rate of pneumonia in patients with comorbidities was high."

We are still slower then R, 37s vs 220s), but the memory usage is very good. It takes the same memory in heap space then on disk

cnuernber commented 4 years ago

Hmmm, interesting...So then you have essentially an encoded string class that shares the backing store using a bigbytearraylist.

I think we will be slower than R for now; that is fine. Probably to get as fast as R we need to take over the CSV parsing which is a chore I don't think any of us is signing up for.

This does point a solid pathway forward, for any strings we record we can save them to one backing store. That as you said keeps us at most the size of the file on disk and less in the case where the string tables or conversion to primitive data saves space.

joinr commented 4 years ago

Curious if you couldn't use something like iota's mmapd line reducer to store the backing, then compute strings on demand, or rip out the guts from iota and use that with some custom parsing bits. hmm

behrica commented 4 years ago

Hmmm, interesting...So then you have essentially an encoded string class that shares the backing store using a bigbytearraylist.

I think we will be slower than R for now; that is fine. Probably to get as fast as R we need to take over the CSV parsing which is a chore I don't think any of us is signing up for.

This does point a solid pathway forward, for any strings we record we can save them to one backing store. That as you said keeps us at most the size of the file on disk and less in the case where the string tables or conversion to primitive data saves space.

Yes, the general principle could work for all String columns, either default or optional. It open as well nice control on "encoding" (I have used UTF-8 and even optionally more specific encoding via "litestring". From my experiments I need less then one byte to store a character, roughly 0.6 bytes. This is of course text specific.

Is it possible that this behavoiur can be "switched" on during parsiong with some like: and then we have it "transparent" on the table ?

So writing and reading to the table would do the correct thing.

(ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz"
                    {:column-whitelist ["SalePrice" "1stFlrSF" "2ndFlrSF" "bigtext-1" "bigtext-2"]
                     :n-records 5
                     :parser-fn {"SalePrice" :float32
                                      "big-text-1"  [:fast-text :encoding "UTF-8" ] 
                                      "big-text-2   [:fast-text :string-to-bytes-fn xxxx    :bytes-to-string-fn yyyy ]
}})
cnuernber commented 4 years ago

UTF-8 encoding is 1byte per character, I think for your experiments something is measuring off. The csv includes commas and newlines and the bigbytearray itself doesn't do compression on the data. If you take a single java string, for example, and measure it first and then encode it I believe you will see exactly 1/2 the size.

Regardless, getting text columns smaller is a large piece of the puzzle.

As for all string columns, string columns are stored in string tables and this usually results in better compression as lots of string columns have high levels of repetition at least in some datasets although not in your medical text. The combination of encoded data and string tables currently doesn't work great really although it does result in some compression; just not a lot. There is potential there.

The safest first move is to make sure that a column that is parsed as 'text' follows the above scheme; that nets a sizeable gain without much risk. In your case (medical data) then it is more like saying 'don't use string tables; only use text' and then having one per-file text storage container although I bet the savings from this and a per-column storage container isn't much.

Controlling the encode/decode pathway per column is definitely a good idea if you know something about your data you could absolutely use less than 1 byte per character.

behrica commented 4 years ago

UTF-8 encoding is 1byte per character, I think for your experiments something is measuring off. The csv includes commas and newlines and the bigbytearray itself doesn't do compression on the data. If you take a single java string, for example, and measure it first and then encode it I believe you will see exactly 1/2 the size.

I mean, the numbers fit together. The whole CSV is 12 GB, I read it in fully with the CVS parser, but only keep a single column. This single column on disk has 8 335 921 109 bytes. The above parsing keep's it in: 8 310 388 458 bytes in heap (so it removes the commas and new lines). I use UTF-8, so one byte per character. Then we need to keep some data on the string seperations, this is another 6134401 longs in my case.

So we use as little memory as possible (without going further into specific encodings using less then a byte per character)

I somehow think the user should be able to choose manually between the 4 "storage types for text":

: tokenized-text (very new, right ?) :string (simple-string-parser) :text (simple-text-parser) :fast-text (using the above pathway)

I am not sure, if there can be "heuristic" which would prefer "fast-text" automatically in a certain situation Probbaly the current defaults, which chooses "string" and "text" automatically are fine. "tokenized-text" and "fast-text" should be on request

cnuernber commented 4 years ago

Why not just have text be fast-text?

behrica commented 4 years ago

Why not just have text be fast-text?

I am not sure, if performance will suffer. Each access to the "String" will do the byte->String conversion.

behrica commented 4 years ago

No to bad, probably. I measured for the 8 GB of text to count and add all string length:

(time 
  (reduce +
          (map
           #(count (String. (.toByteArray (.subList bl %1 %2))))
           (seq index-list)
           (rest (seq index-list)))))

-> 60 seconds

joinr commented 4 years ago

I was doing some naive exploration with memory mapping the text using iota and benching against t.m.d. I did a naive indexed version using LongArrayList and an index-less variant. Some of this was a learning project to mess with defining readers from tech.datatype as well. I used a relatively tiny 195mb sample data file to get a sense of scaling and access times.

iotatest

Looks like the memory mapped approach (at least what I implemented on top of iota, with its limitations), without any smarter compression like t.m.d. does for text, gets you about 50% space savings even on the smallish dataset, at about 3x the cost to access elements. This is on a decent SSD though, no idea what performance is like on other i/o setups. I was expecting the performance difference to be substantial, although I didn't do gobs of random accesses or linear traversals. The implementation is somewhat limited due to iota's setup, which is to index by lines, so I was stuck with a row-centric processing model as opposed to columns. Still, you can cache the row and share amongst readers so it kind of works out if you're traversing rows, and you fall back to recomputing column entries from line splits many times otherwise.

joinr commented 4 years ago

Is your large text corpus public @behrica , and if not, do you know of any that are?

cnuernber commented 4 years ago

Nice, now we are talking. Potentially we can build a mmap representation (isn't that supposed to be Arrow?) during parse time and parse/work with crazy huge files.

The OS should take care of making sure the mmap data is in memory efficiently as long as we access it reasonably. Really interesting - and pretty fun :-).

behrica commented 4 years ago

Mmmaped file buffers in Java are limited to 2GB, so is iota. This can be circumvented by using several mmap buffers and doing chunking. I found a clojure library doing that:

Clj-mmap

behrica commented 4 years ago

Mmmaped file buffers in Java are limited to 2GB, so is iota. This can be circumvented by using several mmap buffers and doing chunking. I found a clojure library doing that:

Clj-mmap

I just saw that iota does this as well. So it can probably handle arbitrary large files. I will try your test with my 12GB file

behrica commented 4 years ago

Is your large text corpus public @behrica , and if not, do you know of any that are?

Mine is from here: https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-07-07.tar.gz

Inside is metadata.csv and I replicated it 50 times to get 12 GB csv.

I could use the above data source and make a nice large CVS by including the "fulltext" of 70 000 papers into it. (currently they are separated, the metadata.csv contains "abstracts" only) This would then be arround 40 GB uncompressed.

Potentially I could share this (on zenodo.org ), as the data is public.

So we could make a blog post using tech.ml.dataset to work on Covid-19 data....

joinr commented 4 years ago

I just saw that iota does this as well.

Yea, I think the original design spec was to handle terabyte sized files for banking data. The only downside is that they make some assumptions about the input regarding encoding; I'm unsure how much of a challenge it would be to generalize (or if it would be acceptable to stick with utf-8 and single byte delimiters).

Thanks for the link, I'll play around with it too.

behrica commented 4 years ago

I just saw that iota does this as well.

Yea, I think the original design spec was to handle terabyte sized files for banking data. The only downside is that they make some assumptions about the input regarding encoding; I'm unsure how much of a challenge it would be to generalize (or if it would be acceptable to stick with utf-8 and single byte delimiters).

I tried with utf-16 and it failed. But by changing the encoding here https://github.com/behrica/iota/blob/master/src/java/iota/FileVector.java#L124 and in oner more other place to utf-16 it worked

So maybe just having a way to configure encoding, would work.

behrica commented 4 years ago

@joinr If I understand your approach correctly, you would "parse" initially a CVS file into a (eventually nested) list of positions. The position is just the "index" where a field starts in the mmaped file. In the optimal case, we would just require one "java long" per "cell" of the CVS. My 12 GB example file, has 1476833776 cells which would require (assuming 64 bits per cell),

(float (/ (* 8391101 22 8) 1024 1024 1024 ))

1.4 GB of heap . Thats a very good move...

The relationship of 12GB vs 1.4GB is because my file has "only" one "rather short" text column. As more long text the CVS file has as larger the improvement.

The idea is brilliant.

Some considerations:

  1. I don't think there is a full featured CSV parser existing, which returns the needed field positions. I googled a bit for "memory mapped CSV", and some people did something in this direction: https://github.com/dw/csvmonkey

  2. It would work very badly for certain types of CSV (numeric fields only), so it should be a optional parsing approach (only for text)

  3. iota does currently not work for UTF-16

Nevertheless it seems feasible and could make a big difference in working with "text heavy" CSV files.

joinr commented 4 years ago

@behrica

Actually, there are two approaches. The one is using iota to mmap the file (you get random access to lines/rows very fast). Then your problem is getting the substring relative to the field you care about. One way is to pre-parse everything, build up all the indices, and store them in an efficient way. So you now have multiple "columns" that provide fast access the string via iota, and have the offsets precomputed, you can look that up efficiently with subsequence on demand. This is cool, but as I saw, you need to cater your indexing scheme (e.g. several million longs in a longarraylist are still relatively "heavy", especially if you only need a a hundredK or less address space). So, you could do the stuff that t.m.d. does and have the indices stored in a promoteable type that fits to the address space (additionally, you can do canonicalization if you'd like, although more involved...).

The other option just says "we don't even need indexing, we'll just compute everything on demand". To in this case, you again leverage iota to get you the nth line/row from the file fast, then on read, compute the offsets and grab the subsequence that you care about. This approach trades computation time for space, since it's just a memory mapped handle over the file really. No indexing required, and it's still surprisingly fast.

I think you could probably build additional optimizations on top of these if you can further tokenize the subsequences, but I went the quick and relatively simple route. One optimization I added was to have a shared cursor for the recently accessed line. This is to prevent us from having to re-fetch it for every column, so we can take advantage of already having paged it in.

There are likely way better implementations, but for a naive off-heap paged text database, it's not too bad.

behrica commented 4 years ago

The 2 explorations and the code we did above have 2 different goals:

1) Mine from above, is a way to make tech.ml.dataset store its text columns in a different way then java.lang.String (namely as large on-heap byte arrays in above example, but "off heap" could probably work similar (direct allocation of ByteBuffers). This "is expected" to work, even when the table gets changed. So the tech.ml code should somehow transform between the "string" and the "byte buffer" on any operation on the the dataset (read / write) This is not directly related to CSV neither. This should be doable, even if I generate a 10 GB dataset via code. This can become very complex, the moment we change the table.... (removing, adding, changing of text)

2) The code of @joinr explores to read a CSV file differently, and create "during parsing of CSV" a list of pointers which point to disk at the end (via mmaped files). And then during "reading" of the table, we read from disk (probably cached by OS) and transform the positions into String. This approach cannot work, the moment we change the table, correct ?

joinr commented 4 years ago

There's option 3. which is to just parse on-demand as needed, which is what the index-less variant does.

joinr commented 4 years ago

So long as you are sitting on top of these various readers, and define transforms (e.g. changes) on top of them that don't force everything to be evaluated and stored in memory, then I think you are okay. You would have to maintain some lineage to the original backing store and define which transforms (changes to the table) would require updating things. I approached this from the idea that the underlying input data from the CSV would not be changing. In the case that the CSV does change, and you have to recalculate indices, then the index-less variant that just parses on-demand, would probably be robust.

joinr commented 4 years ago

By "index-less" I mean, we don't precompute all the offsets for each line in the file. Instead, we just leverage iota's indexing scheme and compute substrings as needed when accessed. This is, in theory, the absolute lightest weight approach you can get, and should be bounded in size by the size of iota's indices for the line breaks.

joinr commented 4 years ago

I expect the byte-array approach from 1. would probably dominate for raw performance at the expense of space though.

joinr commented 4 years ago

After re-reading the comment, I like the classification of the general approaches: efficientstorage of large on-heap text, or efficient access to large off-heap text.

behrica commented 4 years ago

One other view on this discussion is the hierarchy of user expectations on working with (large) data files

As a user of "tech.ml.dataset" working with a large CSV file, my expectations would be (in order of importance):

At time of reading the file

  1. However large the file is ,just reading it in with default settings (so in full), should not crash my repl. This happens now, if the file is too big. I need to kill java and restart repl

(1a Long running operations should show a progress)

  1. If the uncompressed file size on disk is less then my "free RAM" I should be able to "read it in fully". (Ideally without playing with "-XmxYYY" , but would be acceptable)

  2. If the full file is clearly to large for my computer / JVM, I would expect to be able to "read it in partially" (first rows or sampling rows / )

  3. If the full file is clearly to large for my computer, but a subset of columns is small enough, I would expect to be able to read in a few columns in full and to be able to work on them

If some of this "expectations" are not met, there is a high risk, that a user just gives up and does not use Clojure for his analysis.

Both approaches can potentially help drastically for all 4 expectations.

We need to see, if they "just move the possible file size upwards" or are able to "fail nicely" in all cases including a 100 GB file opened by mistake)

cnuernber commented 4 years ago

about #1- Don't you just get an OOM error?