Closed srush closed 3 years ago
Yes, string literals exist and have type String
(which is equal to List Char
). Does the following work?
raw =
-- `do` is syntactic sugar for constructing thunks (argument-less lambdas). I quite like the name `do`.
-- "examples/cmn" has type `String`.
ls = unsafeIO $ do readFile "examples/cmn"
-- Yes, this is the only way to destructure `List` values.
-- I also wish there were something easier.
-- I tried implementing a `toTable (list:List a) -> n => a` function but it didn't work (well, or at all).
(AsList _ im) = ls
unsafeCastTable Full im
I wrote similar data loading code for an in-progress attention model:
' ## Data loading
' Training data [`eng-fra.txt`](https://github.com/L1aoXingyu/seq2seq-translation/blob/master/data/eng-fra.txt) is 9.1 MB. It contains English-French sentence pairs.
inputDataFile = unsafeIO do readFile "eng-fra.txt"
(AsList _ inputData) = inputDataFile
:t inputData
-- Input sizes much greater than 10000 take too long to execute.
inputData' = take inputData 10000
%time
sentencePairs = splitNewline inputData'
Maybe there's some opportunity to create shared data loading functions, like string/byte-manipulation utilities.
Oh strange. I could have sworn I tried that, but I must have had a misspelling in the file so it was crashing.
Text data seems really tough, as you have no idea what the length is. Is take
your function?
Text data seems really tough, as you have no idea what the length is. Is
take
your function?
Yes. I added some functions for manipulating Haskell-like lists to my examples/attention-data.dx
in-progress exploration:
List
operationsdef listTable ((AsList n xs): List a) : (Fin n) => a = xs
def filter (f:a->Bool) (list: List a) : List a
def mapList (f:a->{|eff} b) (list:List a) : {|eff} (List b)
whitespaceChar: Char
, tabChar: Char
, newlineChar: Char
, isSpace (c: Char) : Bool
'a'
character literal syntax may be nice. I think it requires some design decisions about handling Unicode codepoints. Swift 5.3 still doesn't have character literal syntax, the community hasn't committed to a design decision.These have terrible asymptotic performance in Dex because they go against Dex's programming model for arrays/lists.
The primitives have O(n)
performance instead of O(1)
like in Haskell (due to lazy evaluation). Instead, Dex's arrays should eventually be more like C++ std::vector
.
def last (xs:n=>a) : a
def take (n:Type) ?-> (xs:n=>a) (count:Int) : List a
: take the first count
elementsdef drop (n:Type) ?-> (xs:n=>a) (count:Int) : List a
: drop the first count
elementsdef dropLast (n:Type) ?-> (xs:n=>a) (count:Int) : List a
: drop the last count
elementsdef cons (x:a) (xs:List a) : List a
: prepend elementdef snoc (xs:List a) (x:a) : List a
: append elementdef takeWhile (f:a->Bool) (list:List a) : List a
def dropWhile (f:a->Bool) (list:List a) : List a
def span (f:a->Bool) (list:List a) : (List a & List a)
: utility function from Haskelldef break (f:a->Bool) (list:List a) : (List a & List a)
: utility function from Haskelldef split (s:String) (delimiter:Char) : List String
List
-returning functions), after talking with Adam and Dougal.def splitNewline (s:String) : List String
I think linked-list-like operations on Dex arrays are a heavy performance antipattern. But it may be good to document them, for curious users and also for benchmarking purposes - maybe in a file like lib/list-antipatterns.dx
.
I see, yeah I get what Adam and Dougal are sayng.
I would say the first step would be to write a BPE tokenizer? For most modern NLP applications, you don't really need to do much string manipulation. If you can be really efficient in going from List Char -> List Token
you are in a really good spot. Perhaps it is worth just writing that as a for loop. I guess you would need to figure out the data structure for matching tokens.
That makes sense! @srush: could you please recommend a good BPE tokenizer implementation, in any language (C++ or Python)?
Left to my own devices, I'd personally probably use this Swift byte-pair encoder implementation as a reference implementation for porting to Dex.
That swift one looks great.
There is also:
Rust- https://github.com/huggingface/tokenizers
Python- https://github.com/openai/gpt-2/blob/master/src/encoder.py
Alternatively all of these are quite complex in a sparse way and might be beyond dex at the moment. Unless you are looking for a challenge you could just pretokenize to padded sequences and write the NN lib / transformer to start.
Just to make that last point tangible. Fully porting over this small model would be really neat: https://github.com/huggingface/transformers/blob/master/src/transformers/models/distilbert/modeling_distilbert.py
Even besides BPE, still lots of open question about loading model params , layers , optimizers, removing padding.
IO seems to have really come along, but I couldn't really figure it out. Somehow was able to read MNist image digits stored as bytes in a file with the following code. but I felt really gross writing it.