ianand / spreadsheets-are-all-you-need

1.12k stars 175 forks source link

Pure Browser Implementation #2

Open ianand opened 5 months ago

ianand commented 5 months ago

Implement an in-browser spreadsheet to run GPT2. Would be great to have a pure browser implementation while maintaining the spreadsheet interface.

Why:

Additional notes:

ianand commented 5 months ago

Another issue with Excel: For an enhanced version of the Embeddings lesson plan, I wanted to use SVD on a co-occurrence matrix to demonstrate primitive embeddings. But Excel can't run SVD.

will-ca commented 5 months ago

Self-host OnlyOffice or similar?

https://www.onlyoffice.com/spreadsheet-editor.aspx

Should be Excel-compatible.

…Or import to Google Sheets?

ianand commented 5 months ago

Google Sheets won't work. It's too big to fit. I tried multiple ways, even using clasp. An alternative might be OpenOffice calc in the meantime for people who don't have excel but haven't tried it yet.

nhatcher-frequenz commented 5 months ago

Hi @ianand, if you think IronCalc mentioned above could be what you are looking for I would be happy to prioritize work to make it happen or tell you if there are major roadblocks.

ianand commented 5 months ago

@nhatcher-frequenz thanks for reaching out. IronCalc looks cool and promising. How can I help?

It doesn't look the version in the playground can read/write excel files yet?

nhatcher commented 5 months ago

Hi @ianand, indeed you cannot upload an Excel workbook in the playground. Our best chance right now is to use a tailored script or use a TUI like https://github.com/ironcalc/TironCalc.

I think the next steps are for me to identify what is possible and what is not. I will get back at you in the next couple of weeks.

ianand commented 5 months ago

Sounds good. Let me know how it goes.

will-ca commented 5 months ago

@nhatcher IronCalc says it uses Rust, and WASM for web. Per my understanding, WASM has a hard cap at 4GB of RAM due to being a 32-bit format, and can even run into problems long before due to issues with freeing, address space, and fragmentation. JS reportedly has similar usage limits, visible in the Chromium console or Firefox about:config.

The .xlsb download on this repository is over 1GB. Presumably, loading and evaluating it takes several times that. Do you have a solution to this possible problem?

nhatcher commented 5 months ago

Hi @will-ca, my own experiments seem to confirm what you are saying. When I run the model in the bare metal my OS reports ~ 12Gb. I have not given up just yet. But running in the browser seems tough

ianand commented 5 months ago

@will-ca thanks for the heads up and @nhatcher thanks for smoke testing.

there is evidently a wasm64 spec that's still experimental as of now but can be enabled via the Chrome flags https://github.com/WebAssembly/memory64/issues/36

ianand commented 5 months ago

Don’t know why I didn’t think of this earlier since I considered it this approach for Google sheets awhile back. The model is very modular so it should be easy to split into multiple separate wasm instances (in separate workers? Or separate tabs?) that each should be able to fit into 4gb. Feeling like this should be surmountable. And a first test in ironcalc would be to just extract one of the layer tabs from the excel sheet and its associated weight matrices and check if we can get a single layer of the transformer to run. That’s the first mvp/poc. But we’ll need those compatible formulas. @nhatcher thoughts?

nhatcher commented 5 months ago

Hey @ianand, that might work actually. If you are able to split the workbook into 4 different ones using external references (like ='[Workbook2]Sheet3'!D3) then we might have a chance. But we are not there yet, the required formulas might take months to implement. Some formulas are just a few hours work, but the dynamic arrrays and the Lambdas will need sometime and by then the wasm64 ecosystem might be in place. Also it is not clear to me that the workbook will compute in a reasonable amount of time. I thought I could mock sone of those functions and get a rough idea of how long the computation would take but I can't do it in confidence. I think there are a couple of tricks under my sleeve and I might get back at you in a couple of months with some realistic data, once those functions are implemented

ianand commented 5 months ago

If you are able to split the workbook into 4 different ones using external references (like ='[Workbook2]Sheet3'!D3) then we might have a chance.

That would be very easy to do. As an aside, maybe an interesting variant would be to have these separate tabs running on separate machines.

But we are not there yet, the required formulas might take months to implement. Some formulas are just a few hours work, but the dynamic arrrays and the Lambdas will need sometime and by then the wasm64 ecosystem might be in place. Also it is not clear to me that the workbook will compute in a reasonable amount of time. I thought I could mock sone of those functions and get a rough idea of how long the computation would take but I can't do it in confidence.

I think there are a couple of tricks under my sleeve and I might get back at you in a couple of months with some realistic data, once those functions are implemented

Good points @nhatcher. I wonder if I can make the job easier by "meeting in the middle", i.e. I modify the spreadsheet implementation to more closely match what's currently available in IronCalc. Specifically, I could trying re-implement a single layer of GPT2 (i.e. the Block_0 tab in the current sheet) without using Lambdas, etc. Not sure how hard that is though.

Can I assume https://github.com/ironcalc/IronCalc/blob/e9fc41541b6e60d66430db68802cf9bdecf378c0/base/src/functions/mod.rs#L70 is the list of currently implemented functions?

BTW Does IronCalc support R1C1 style references?

nhatcher commented 5 months ago

Can I assume https://github.com/ironcalc/IronCalc/blob/e9fc41541b6e60d66430db68802cf9bdecf378c0/base/src/functions/mod.rs#L70 is the list of currently implemented functions?

Yes, the list of supported functions will also be at the wiki: https://github.com/ironcalc/IronCalc/wiki/

BTW Does IronCalc support R1C1 style references?

IronCalc does all it's computations with the R1C1 style internally. The A1 style is only used for display.

Good points @nhatcher. I wonder if I can make the job easier by "meeting in the middle", i.e. I modify the spreadsheet implementation to more closely match what's currently available in IronCalc. Specifically, I could trying re-implement a single layer of GPT2 (i.e. the Block_0 tab in the current sheet) without using Lambdas, etc. Not sure how hard that is though.

I think we can have this conversation once IronCalc is a bit more developed. Once we hit MVP and we have a page in which you can try your formulas easier uploading and downloading Excel workbooks I could ask you to simplify your model a bit.

A couple of things I have learned this week. I managed to compile it to wasm64 and it doesn't seem the model will ever run in the browser. A 12 Gb webpage seems to be to much even for modern browsers.

But IronCalc will open it, make changes and evaluate it just fine. You will either need to use a not yet developed desktop version, or the TUI Tironcalc. That by itself might be useful for your purposes. Millions of people don't have an access to an Excel Licence. The only question mark is how long would, once the formulas are implemented, take IronCalc to evaluate the model. It might be more time that you are comfortable with (over 5 minutes, maybe more?)

ianand commented 5 months ago

Yes, the list of supported functions will also be at the wiki: https://github.com/ironcalc/IronCalc/wiki/

Thanks.

IronCalc does all it's computations with the R1C1 style internally. The A1 style is only used for display.

Great.

I think we can have this conversation once IronCalc is a bit more developed. Once we hit MVP and we have a page in which you can try your formulas easier uploading and downloading Excel workbooks I could ask you to simplify your model a bit.

Ok.

A couple of things I have learned this week. I managed to compile it to wasm64 and it doesn't seem the model will ever run in the browser. A 12 Gb webpage seems to be to much even for modern browsers.

Thanks for trying. What is the failure mode? Very slow? Out of memory? I still wonder if splitting across layers might help.

But IronCalc will open it, make changes and evaluate it just fine. You will either need to use a not yet developed desktop version, or the TUI Tironcalc. That by itself might be useful for your purposes. Millions of people don't have an access to an Excel Licence. The only question mark is how long would, once the formulas are implemented, take IronCalc to evaluate the model. It might be more time that you are comfortable with (over 5 minutes, maybe more?)

Good to know about TUI. It's not accessible as something in a browser but that's better than nothing, especially for those without an Excel license as you point out.

I'm look forward to when you think we can do an MVP with a single layer of the model.

nhatcher commented 4 months ago

Thanks for trying. What is the failure mode? Very slow? Out of memory? I still wonder if splitting across layers might help.

It's eventually killed in my laptop by the oom killer. But it is difficult to say at this point if the problem is solvable, could be an error on the wasm64 side or could be that we just can't parse that data structure into a browser. Another difficulty is that I have to "mock" parts of the workbook to be able to parse it without error. I think it is essentially correct but there are many "what-ifs". At this point, the our best chance is to get IronCalc up to speed. Anyway, as soon as I get some indication that simplifying the workbook somehow or some other solution I will get back to you.