Special support of tech.ml.dataset to store "text corpora" efficiently ?

behrica commented 4 years ago

I am wondering if the abstractions present in tech.lml and tech.ml.datatypes would allow to have a specific form of "column type" in tech.ml.dataset which store larger texts very efficiently (and not as java.lang.String). A bit of background for my use case.

I deal with NLP and a lot of times I have large tables (millions of rows) in which one column is "long text". It could be up to thousands of words per table cell.

int a	int b	..	...	large text
2	4	..	...	this is long text
3	5	..	..	this is another long text

This would be far more efficient to store, using a vocabulary :

id	word
0	this
1	is
2	long
3	text
4	another

and then

int a	int b	..	...	large text
2	4	..	...	0,1,2,3,4
3	5	..	..	0,1,2,4,3

I would imagine, that it would reduce drastically the space need to store a text corpora ... 90 % less (Of couse for the price of "slower" reads and writes to the table)

Ideally this should be "transparently handled" by the "table object".

So reading such a specific column "large text" would reconstruct the String and give me the string back and writing to it would "re code the strings as arrazs of int" and eventually enlarging / shrinking the vocabulary.

I would see the "vocabulary" as "attached" to a table, but maybe it should be possible to share the same vocabulary between different tables.

Implementing this well could facilitate the usage of Clojure (and tech.ml.dataset ) for various NLP tasks.

My key question here would be, if the existing "abstractions" in the tech.ml universe would allow to implement this "completely transparent"

behrica commented 4 years ago

It's the "experience", which is the problem.

It takes very long without any feedback, at least minutes while the JVM seems to "hang". I now attach visualvm to the process, to get some feedback. I use Emacs and Cider, which probably plays a role as well.
Before the OOM error comes, I just see GC activity very high for "several minutes", I would say.
And typically after the OOM, the Cider Repl (or Emacs) is not working any more.

This is not specific to tech.ml.dataset.

I have the same issue, regarding large csvs, with other libraries (semantic-csv) which creates sequence-of-maps. The JVM has difficulties to handle "a large amount" of small and mid-size String objects, as occurring during CSV parsing.

I really have the impression, that after a certain moment the JVM oscillates between

"reading one line" and "doing GC" And only after "some minutes" it gives up with OOM.

I noticed large improvements in doing this: https://github.com/techascent/tech.ml.dataset/issues/102#issuecomment-654841560

If with this code the file is "too big", I get very quick a OOM, and afterward the repl stays responsive.

behrica commented 4 years ago

I noticed large improvements in doing this: #102 (comment)

If with this code the file is "too big", I get very quick a OOM, and afterward the repl stays responsive.

Because with this, only String objects for the "current line" are generated (and are short living). All "large" text data is in a single (or a few) large byte arrays.

So probably after the OOM it needs only GC a few (very large) objects, which happens immediately.

I don't see any frenetic GC activity as with "normal String storage"

behrica commented 4 years ago

So long as you are sitting on top of these various readers, and define transforms (e.g. changes) on top of them that don't force everything to be evaluated and stored in memory, then I think you are okay. You would have to maintain some lineage to the original backing store and define which transforms (changes to the table) would require updating things. I approached this from the idea that the underlying input data from the CSV would not be changing. In the case that the CSV does change, and you have to recalculate indices, then the index-less variant that just parses on-demand, would probably be robust.

I did not meant to change csv file on disk. I was wondering what would happen if I add a row to a dataset which has been parsed as you did. This new text row would not have any backing in the mapped CSV. It would need to be stored as usual. So "the same column" needed to use "different backends, depending if the row comes from the file or was added later.

It's not a very common use case, I agree, so maybe a kind of "read-only" column would be ok. Most of the time I add columns and do some form of filtering of rows

behrica commented 4 years ago

So long as you are sitting on top of these various readers, and define transforms (e.g. changes) on top of them that don't force everything to be evaluated and stored in memory, then I think you are okay.

I looked again at the code and see what you mean. There is no "concept of a appendable dataset". We have sequences of column data... , so there is no changing.

behrica commented 4 years ago

So long as you are sitting on top of these various readers, and define transforms (e.g. changes) on top of them that don't force everything to be evaluated and stored in memory, then I think you are okay.

I looked again at the code and see what you mean. There is no "concept of a appendable dataset". We have sequences of column data... , so there is no changing.

I see know, that the whole tech.ml.dataset abstraction has no concept of "change", it is as immutable as most of Clojure. I had the impression that there was a certain form of "update" concept. This makes things easier ....

So some of my comments above do not make sense any more...

cnuernber commented 4 years ago

https://github.com/techascent/tech.datatype/blob/master/docs/01-operations.md

I think that may help. This is actually based on a completely new architecture for vectorized functional math that has never been tried by anyone I know of in history.

cnuernber commented 4 years ago

By vectorized I mean follows the property that (+ a b) returns readers of A or B is a reader. This is a form of polymorphism that does not exist in Clojure proper. This is more like numpy or APL.

joinr commented 4 years ago

@behrica So it turns out, univocity (which already ships with t.m.d.) has support for selected index parsing as well out of the box, and is already really efficient and able to handle vast swaths of encodings. I just wrapped their line parser and use it to extract entries on demand instead of my toy function. The end result is even better than before, now it's around 2x slower than the dataset implementation for the original smallish file and very narrow lookup case.

I loaded the metadata.csv file you linked from (just the normal 251mb version, without any concatenation to make multi-gig, and iota appears to handle it fine. tech.ml.dataset won't load the .csv file out of the blocks, complaining about something inside univocity (probably the quote characters, I'm not sure though).

behrica commented 4 years ago

I loaded the metadata.csv file you linked from (just the normal 251mb version, without any concatenation to make multi-gig, and iota appears to handle it fine. tech.ml.dataset won't load the .csv file out of the blocks, complaining about something inside univocity (probably the quote characters, I'm not sure though).

Try to add this : {:max-chars-per-column 1000000}

behrica commented 4 years ago

I just would add a call-back for progress reporting to it:

(defn univocity-text-columns
  (^ObjectReader [columns backing] (univocity-text-columns "\t" columns backing (fn [idx bound])))
  (^ObjectReader [delimiter columns ^iota.FileVector backing progress-fn]
   (let [col-count      (count columns)
         bound          (count backing)
         cursor         (atom {:idx -1})
         parsef          (->parser :delimiter delimiter :indices columns)
         load-row!      (fn [idx]
                          (let [v (.deref ^clojure.lang.Atom cursor)]
                            (if (== ^long (v :idx) ^long idx)
                              v
                              (let [line (nth backing idx)]
                                (progress-fn idx bound)
                                (reset! cursor {:idx idx
                                                :line line
                                                :row  (parsef line)})))))]
     (reify ObjectReader
       (lsize [rdr] col-count)
       (read [rdr col-idx]
         (dtype/object-reader
          bound
          (fn [^long idx]
            (let [state (load-row! idx)
                  ^objects row   (state :row)]
              (aget row col-idx)))))
       clojure.lang.Indexed
       (nth [this idx]
         (.read this))
       (nth [this idx not-found]
         (if (and (not (neg? idx))
                  (<= idx bound))
           (.read this idx) not-found))
       ))))

behrica commented 4 years ago

@behrica So it turns out, univocity (which already ships with t.m.d.) has support for selected index parsing as well out of the box, and is already really efficient and able to handle vast swaths of encodings. I just wrapped their line parser and use it to extract entries on demand instead of my toy function. The end result is even better than before, now it's around 2x slower than the dataset implementation for the original smallish file and very narrow lookup case.

I loaded the metadata.csv file you linked from (just the normal 251mb version, without any concatenation to make multi-gig, and iota appears to handle it fine. tech.ml.dataset won't load the .csv file out of the blocks, complaining about something inside univocity (probably the quote characters, I'm not sure though).

Very impressive.

I did some testing with my 12 GB file and even with 1 GB heap, I can work with the file very comfortably. Some kind of thing I would do, would be to :

(time
  (reduce +
          (map
           count
           (first munivocity-text-cols))))
  )

Takes 2 minutes to run with my 12 GB file.

behrica commented 4 years ago

One last benchmark with a 113 GB file, which worked without issues. Taking the text column, which is around 80 GB, I did the word frequency count of all tokens:

(reduce (partial merge-with +)
             (map
              #(frequencies
                (clojure.string/split % #" "))
              (remove empty?
                      (first munivocity-text-cols)  )))

took 41 minutes using hardly any memory. The overall number of word counted is one billion

behrica commented 4 years ago

took 41 minutes using hardly any memory. The overall number of word counted is one billion

This seems to be faster then using Linux command line tools:

$ time csvcut -c abstract huge.csv | pv -l -s 83911010 | tr " \t" "\n" | sort -u
3.61M 0:12:12 [5.55k/s] [=>                                                                      ]  4% ETA 4:31:24

ETA: 4 hours

joinr commented 4 years ago

That's moving some big freight :) if frequencies are desirable, it's also trivial to chunk the file and process in parallel.

behrica commented 4 years ago

That's moving some big freight :) if frequencies are desirable, it's also trivial to chunk the file and process in parallel.

Yes, the code was just an example, of things I often do.

I am know convinced that the approach works very well, even for the largest files and performance is the best I have seen so far with Clojure.

How can we integrate this deeper into the existing DataSet ?

joinr commented 4 years ago

If that's broadly useful, maybe encapsulating the concept in a small library, then pulling that in as a dep for t.m.d.? Seems like there are opportunities for reusing that stuff (off heap large file parsing stuff).

joinr commented 4 years ago

Then think through implementing some kind of text column wrapping this stuff.

joinr commented 4 years ago

@behrica very glad to see you are able to get some benefit with your typical workflows with this approach. Thankful that @thebusby posted iota for use in a potentially vastly different context than originally intended.

behrica commented 4 years ago

If that's broadly useful, maybe encapsulating the concept in a small library, then pulling that in as a dep for t.m.d.? Seems like there are opportunities for reusing that stuff (off heap large file parsing stuff).

I cannot see it outside of t.m.d. The current code allows me to "cut" the text columns and work with them in isolation. But that is hardly the usual case.

NormalI need "large text columns" and "numeric columns" of a data set at the same time and the full power of all functions in t.m.d

I would think that "->dataset" could accept a new "source" (apart from raw/gzipped csv/tsv, xls, xlsx, json, and sequences of maps, namely "memory mapped CSV files" and it does then "the right thing", namely picking the correct reader for all columns. (configurable) So for large string columns this should be your ObjectReader from "iotatest" If the file is large enough, maybe even all columns should be backed memory-mapped, in order to save memory.

joinr commented 4 years ago

Yeah, there seems to be a more general precept here: the choice of whether to memory-map fields in the input (either communicated by the caller or inferred as part of some scheme by t.m.d.). The result will be an object reader over the input file, likely a subset of the desired fields. I'm not super strong on design aesthetic, but my sense is that there's probably a clean way to express this and integrate it so it's easy to maintain.

I can think of instances where I would like a subselect of fields parsed from a potentially large CSV/TSV file (typically log parsing leading into some post processed analytics and visuals). There's some fusion of iota and univocity and tech.datatype (as in the iotatest example) that seems broadly useful in this kind of a context. Instead of mapping clojure.string/split or using cojure.data.csv or whatever, you can get a really efficient parse on top of memory mapped stuff, and then do whatever you want after the fact (feed to a dataset constructor, do some generic clojure processing on the columns, etc.).

cnuernber commented 4 years ago

This is a great thread. I think it is really enjoyable to read and is making solid technical progress. I really appreciate this.

I want to rephrase these last few messages to be I understand them completely.

Build a memory-mapped, in-place file loader that essentially pre-parses a csv to know where the cells are and if they need to be escaped or not (with not being the common case) are but then stores that data in an in-memory data structure to allow sort of as-rapid-as-possible random access to the string data in any given cell. Users would have control perhaps of the subrect of cells the parser actually builds the full parsing info on.

On top of this layer enough parsing to either heuristically infer the column type or let the user specify the column type in the same manner as the existing csv parser. If the data could then be lazily parsed only on access or could be parsed out in parallelized blocks or many other options exist at that point.

I really like this idea, technically. Here are some potential benefits I see:

This would in theory result in the potential to load any size csv file that fits on disk if there is enough memory to store whatever portion of that file you want the subrect of low level cell information for. It should result in needing to escape all cells at least twice but in return can dramatically lowers the memory usage for operating on particular types of datasets. It also fits well within the lazy reader operation ecosystem of tech.datatype so you really could have a view of the file that was parsed purely on demand.

If so, I do see that is two separate things. Really interesting pathway to this point!

behrica commented 4 years ago

This is a great thread. I think it is really enjoyable to read and is making solid technical progress. I really appreciate this.

I want to rephrase these last few messages to be I understand them completely.

Build a memory-mapped, in-place file loader that essentially pre-parses a csv to know where the cells are and if they need to be escaped or not (with not being the common case) are but then stores that data in an in-memory data structure to allow sort of as-rapid-as-possible random access to the string data in any given cell. Users would have control perhaps of the subrect of cells the parser actually builds the full parsing info on.

This only saves memory for a very specific type of columns, namely "long texts". Todo that for a full CSV file, in which most columns are numbers and only a few are texts, it would not save memory for numeric columns.

So I think this "intermidiate cell boundary storage" should be done for specific columns only. It requires nevertheless to have the whole file memory mapped.

So maybe "by default" any csv file should be read via iota and memory mapped. But only for "long text" columns, we store the start - end indexes, and read strings "late" via the reader as depicted above. All other columns get read as usual and stored in native arrays.

This would have the nice side effect, that accessing the numeric columns would have no performance penalty, and only reading the long strings would need to read from the mapped file.

behrica commented 4 years ago

As I was wondering, what we would do for non-csv file I got an other idea.

Instead of "memory map" the original CSV file, we could parse the CSV as usual, BUT:

store all long strings found during parsing in a "new temp file, one string after the other", and store the start-end indexes and memory map the temp file at the end.

This is similar to store long strings in huge byte buffers of heap or out-of-heap

This would then work for other file formats as well. (for example zipped CSV).

This is again the question of accessing vs storing of long strings.

cnuernber commented 4 years ago

I like that idea. Parse as usually but store long strings in a temp file that we can mmap and access. That is logically a small change from https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/ml/dataset/text.clj#L140, you could implement a typed byte list like that that stored data in a file and had a mmap command or something that used iota after you finished writing the data. In fact I wonder if iota has a class like that.

joinr commented 4 years ago

@behrica That last bit about parsing into a file works perfectly with iota as is, if you parse the new text file into a format iota already supports (no extra dev work there), then just map the temp file into a file vector. This assumes you have access to making temp files (perhaps not a huge constraint, but may be one for more restrictive systems, just a thought). You probably have the added benefit that this tactic would work with anything (e.g. input streams) and not require having access to the source input file, but a controlled derivative index. You're basically creating your own local off-heap string index/indices (potentially very large), which also opens up many opportunities for compression and storage patterns.

It seems like the expectation is that there are certain classes of inputs where we have columns that necessitate this approach (e.g. some presence of text columns with "large enough" strings that are unique). Are we certain there are not classes of inputs with numeric data, where holding all numeric data in-memory is also untenable? As you stated, that probably expands the scope of the original question (focused on aiding processing large text corpora), but perhaps (I think as @cnuernber alluded to) a general paging/buffering strategy enables arbitrary processing of large datasets with acceptable performance (relative to not being able to process at all).

There's also an interesting question, at what point does latching onto an existing formal storage mechanism (e.g. db) make sense too. Given the ability to memory map stuff, it kind of pushes the line of demarcation for using a more complex, but also more efficient storage backend such as a database.

joinr commented 4 years ago

Relative to #122 , off-heap approaches that retain some reference to a backing file will cause problems for general purpose serialization. Seems like another implicit line of demarcation.

cnuernber commented 4 years ago

It is going to be painful to get to work but I think after spending time from this that it should be possible to parse any size of CSV file into a arrow data backing store and operate on that. It may be a parse option to use arrow as the backing memory store behind the dataset just for the initial parse run. Arrow takes care of batching and such and tighter integration with Arrow specifically seems like a very solid research pathway. The currently integration just copies the data into java vectors but we can definitely access the data in-place with a bit of research.

The goal of this would be to solve the larger-than-memory dataset problem in its general form by being able to use ->dataset that creates a dataset that uses arrow as the backing store. Then have good support for working with on arrow data stores in-place.

behrica commented 4 years ago

The work done on arrow in 3.0.8 shows the way to go for large text corpora (and other larger then RAM files)

This went of course the "accessing" large data , and not the "storing large data" route, which was discussed at a certain moment.

For certain big data scenarios, I would still find it interesting if we could tell t.m.d, that certain large columns could be "stored" either off-heap or even memory-mapped on disk... This goes maybe to much into t.m.d becoming a "database".

So I think we can close this issue for now.

techascent / tech.ml.dataset

Special support of tech.ml.dataset to store "text corpora" efficiently ? #102