google-research / dex-lang

Research language for array processing in the Haskell/ML family
BSD 3-Clause "New" or "Revised" License
1.56k stars 106 forks source link

adding support for sparse matrices/arrays? #1310

Open cartazio opened 1 year ago

cartazio commented 1 year ago

I spoke with @axch (hey Alexey!) at POPL 2023 about one particular approach for augmented support for sparse and various structured matrix/array formats, though it does require weakening the mapping between coordinates and values in a simple but possibly reasonable way. I have a rough haskell analogue that show cases the ideas in this repo here

but the core idea is roughly as follows

--- Index is a size indexed list of integers
class Layout format where 
     type AddressRep format
     type Rank format
     compareIx :: format -> Index (Rank) -> Index (Rank) -> ORD
     --- every address layout *must* have an ordering of the indices
     ---  that respects the associated address of values layout ordering, 
     ----this lets you write good locality code without knowing the memory layout!
     --- ie if ix1 and ix2 are in bounds, they must be totally ordererable, 
     ---- even if the coordinates are invalid and do not map to concrete addresses
     compareAddr :: (addr ~ AddressRep format) => format -> addr -> addr -> ORD
     --- compareIx and compareAddr should agree on point that are valid in both index and address space
     addressRange ::(addr ~ AddressRep format) => format -> Maybe (addr,addr)
     --- returns the first and last valid addresses for a given format datum. should be O(rank) or ideally O(1)
     address2Index :: (addr ~ AddressRep format) => format -> addr -> Index (Rank format)
     -- address2index should always succeed for valid addreses, 
     -- and should have complexity O(rank format). eg O(2) for matrices :). 
     --- Constant time for fixed choice in layout format 
     addressPopCount :: (addr ~ AddressRep format) =>form -> (addr,addr) -> Word64
     --- addressPopCount form (addr1,addr2) counts the number of valid addresses 
     --- inclusive of the min and max points as determined by the tuple (addr1,addr2)
     nextAddress :: (addr ~ AddressRep format) => format -> addr -> Maybe addr
     -- nextaddress should also have complexity O(rank format), 
     --this can be accomplished by having the associated address datatype carry a little bit of extra information 
     --- for pretty much every format i've come across
     affineShift ::  (addr ~ AddressRep format) => format -> addr -> Int64 -> Maybe addr
     --- `affineShift format address shift`should strive to have 
     ---complexity O(log min(|shift|, pop count of format range )) or better
     --- achieving this complexity bound may require extra 
     --- precomputed metadata in the datatype definition of the Format `form`
     index2address :: (addr ~ AddressRep format) => format -> Index (Rank format) -> Maybe addr 
     index2address form ix =  case findIndex form ix of Nothing -> Nothing ;
                                               Just (ix2,addr) -> if ix2==ix then Just addr else Nothing 
     findIndex :: (addr ~ AddressRep format) => format -> Index (Rank format) -> Maybe (Index (Rank format), addr)
     -- findindex form ix, finds the first address in the layout format that corresponds with an index equal to or greater  than ix
     --- if `ix` is a valid coordinate, findindex form ix will return Just (ix,addr_of_ix)
     --- as determined by the ordering relation `compareIx` (equivalently the  ordering of increasing addresses )
     --- a further optimized version would take an extra `Maybe addr` argument as a hint 
     ---for where to start the search for the right index,address pair result tuple.

this is a condensed version of the core ideas in the above link, though a bit more modernized than the linked code (which doesn't quite match this at the moment). One fun thing to notice is that this api is just strong enough to express ideas like the Leap frog trie join algorithm (tis but a very very fancy sparse dot product) while not making strong assumptions about the particular memory layout/formats.

I'm happy to explain these ideas further. My understanding is that dexlang currently assumes roughly index2address :: index -> address rather than the above index2address :: index -> Maybe address api. Which probably has a lot of knock on consequences in the current system.