google-research / dex-lang

Research language for array processing in the Haskell/ML family
BSD 3-Clause "New" or "Revised" License
1.58k stars 107 forks source link

IO to Tables #449

Closed srush closed 3 years ago

srush commented 3 years ago

IO seems to have really come along, but I couldn't really figure it out. Somehow was able to read MNist image digits stored as bytes in a file with the following code. but I felt really gross writing it.

raw =                                                                                                                                                                                                                
    ls = unsafeIO $ \ _. readFile (AsList _ ['e', 'x', 'a', 'm', 'p', 'l', 'e', 's', '/', 'c', 'm', 'n'])                                                                                                            
    (AsList _ im) = ls                                                                                                                                                                                               
    unsafeCastTable Full im
dan-zheng commented 3 years ago

Yes, string literals exist and have type String (which is equal to List Char). Does the following work?

raw =                                                                     
    -- `do` is syntactic sugar for constructing thunks (argument-less lambdas). I quite like the name `do`.
    -- "examples/cmn" has type `String`.
    ls = unsafeIO $ do readFile "examples/cmn"
    -- Yes, this is the only way to destructure `List` values.
    -- I also wish there were something easier.
    -- I tried implementing a `toTable (list:List a) -> n => a` function but it didn't work (well, or at all).
    (AsList _ im) = ls
    unsafeCastTable Full im
dan-zheng commented 3 years ago

I wrote similar data loading code for an in-progress attention model:

' ## Data loading

' Training data [`eng-fra.txt`](https://github.com/L1aoXingyu/seq2seq-translation/blob/master/data/eng-fra.txt) is 9.1 MB. It contains English-French sentence pairs.

inputDataFile = unsafeIO do readFile "eng-fra.txt"

(AsList _ inputData) = inputDataFile

:t inputData

-- Input sizes much greater than 10000 take too long to execute.
inputData' = take inputData 10000

%time
sentencePairs = splitNewline inputData'

Maybe there's some opportunity to create shared data loading functions, like string/byte-manipulation utilities.

srush commented 3 years ago

Oh strange. I could have sworn I tried that, but I must have had a misspelling in the file so it was crashing.

Text data seems really tough, as you have no idea what the length is. Is take your function?

dan-zheng commented 3 years ago

Text data seems really tough, as you have no idea what the length is. Is take your function?

Yes. I added some functions for manipulating Haskell-like lists to my examples/attention-data.dx in-progress exploration:

List operations

Characters

Haskell-like linked list utilities

These have terrible asymptotic performance in Dex because they go against Dex's programming model for arrays/lists.

The primitives have O(n) performance instead of O(1) like in Haskell (due to lazy evaluation). Instead, Dex's arrays should eventually be more like C++ std::vector.

String manipulation


I think linked-list-like operations on Dex arrays are a heavy performance antipattern. But it may be good to document them, for curious users and also for benchmarking purposes - maybe in a file like lib/list-antipatterns.dx.

srush commented 3 years ago

I see, yeah I get what Adam and Dougal are sayng.

I would say the first step would be to write a BPE tokenizer? For most modern NLP applications, you don't really need to do much string manipulation. If you can be really efficient in going from List Char -> List Token you are in a really good spot. Perhaps it is worth just writing that as a for loop. I guess you would need to figure out the data structure for matching tokens.

dan-zheng commented 3 years ago

That makes sense! @srush: could you please recommend a good BPE tokenizer implementation, in any language (C++ or Python)?

Left to my own devices, I'd personally probably use this Swift byte-pair encoder implementation as a reference implementation for porting to Dex.

srush commented 3 years ago

That swift one looks great.

There is also:

Rust- https://github.com/huggingface/tokenizers

Python- https://github.com/openai/gpt-2/blob/master/src/encoder.py

Alternatively all of these are quite complex in a sparse way and might be beyond dex at the moment. Unless you are looking for a challenge you could just pretokenize to padded sequences and write the NN lib / transformer to start.

srush commented 3 years ago

Just to make that last point tangible. Fully porting over this small model would be really neat: https://github.com/huggingface/transformers/blob/master/src/transformers/models/distilbert/modeling_distilbert.py

Even besides BPE, still lots of open question about loading model params , layers , optimizers, removing padding.