In Firefox the memory size of the inference engine is quite large in wasm. There aren't good memory tools to analyze the wasm. Instead, we should compile it natively, and analyze the memory there.
Some details from another document.
This worker has a copy of the models (~20-50mb) and a copy of the engine binary (in the range of a few mb). However, when the engine is running, the memory balloons to ~250mb of RSS, and ~450mb of reserved wasm heap memory. It's unclear without further analysis where exactly this memory is coming from, but my assumption is that the model gets copied to the ExpressionGraph class in Marian. Marian has its own allocator called a Workspace. The tensors are allocated there.
The work here would be to integrate dhat or some other memory tool into a build of marian-dev, and run one of our quantized models in it. This should tell us the call sites where memory is being allocated. We can also inlist some other Firefox platform experts who can help analyze things when we have something that is working through a taskfile command.
In Firefox the memory size of the inference engine is quite large in wasm. There aren't good memory tools to analyze the wasm. Instead, we should compile it natively, and analyze the memory there.
Some details from another document.
The work here would be to integrate dhat or some other memory tool into a build of marian-dev, and run one of our quantized models in it. This should tell us the call sites where memory is being allocated. We can also inlist some other Firefox platform experts who can help analyze things when we have something that is working through a taskfile command.