New model addition: MarkupLM

pogzyb commented 6 months ago

Model description

The MarkupLM is BERT, but applied to HTML pages instead of raw text documents. Seems like there could be a lot of interesting uses for this type of model in the browser.

Prerequisites

[X] The model is supported in Transformers (i.e., listed here)
[X] The model can be exported to ONNX with Optimum (i.e., listed here)

Additional information

I think the most difficult part of the implementation will deal with markuplm's preprocessing. Specifically, markuplm uses a combination of a "feature extractor" and a "tokenizer". the "feature extractor" extracts nodes and xpaths from HTML strings. These nodes and xpaths are then fed to the "tokenizer" to produce xpath tag and subscript sequences. The Python implementation uses BeautifulSoup, so the JavaScript implementation might need a 3rd party HTML parsing library if DOMParser doesn't cut it.

In short, there are 2 additional xpath inputs to the model needed: 'input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'

Your contribution

I added huggingface/optimum#1784 in optimum, but I'm not much of a JavaScript developer. I'd be happy to try either implementing the preprocessing or the pipeline, but I would need some guidance/regular reviews.

xenova commented 6 months ago

Hi there! 👋 This does sound pretty interesting! I would imagine the built-in document parser should be sufficient. I'd be happy to review if you (or another community member) would like to open a PR!

jonathanpv commented 6 months ago

It's not entirely obvious to me what this model does from the hugging face docs, but if its able to make an

xpath -> goal / feature

then we can support local agentic solutions, or writing out instructions, running them would be a different story

For example: xpath /some/div/here -> this div is a button that will submit an order /some/other/div/here -> this div handles file uploading

benefit: local-first private solution, could be a quick "accessibility-vibe-check" to see if an ai can figure it out your user can too?

Thats just one app idea, here's another:

browser LLM powered site cloner: /some/div/here -> this div is a button that will submit an order /some/other/div/here -> this div handles file uploading

then pass those divs as css-selectors or xpath-selector? to select a div to clone / translate to nextjs components using gpt4 or some hugging face model that excels in coding front end things

benefit: token efficiency vs cloud solution, local-first approach

here's the app ideas i have that may leverage this (unsure tbh what the model does out of the box):

chat with html
html to selenium code eg "given this html write selenium to book a flight"
html ui cloner app (outlined above)

curious what others think

pogzyb commented 6 months ago

@xenova - sounds good! I'll try to take a crack at it, and if any community members would like to help or offer their advice, that'd be appreciated.

@jonathanpv - my main focus with the model has been to fine-tune it for cybersecurity related tasks. Here's a first draft of a fine-tuned model I trained: pogzyb/markuplm-phish. From my experience, a fine-tuned MarkupLM performed better than a fine-tuned BERT on phish/malicious website classification. The final goal of my project is to create a browser extension with the added benefit that the user's data stays local to their machine like you pointed out.

I think the html to selenium code generation is good one! Another idea I was thinking about was automatic page re-orientation like how browsers offer "reader mode" for some websites. The model/app could optimize user experience on "clunky" pages (move text around, resize images, summarize or hide irrelevant sectons/nodes/paragraphs). Even tutorial use-cases similar to what "WalkMe" does could be leveraged.

jonathanpv commented 6 months ago

Here's a first draft of a fine-tuned model I trained: pogzyb/markuplm-phish. From my experience, a fine-tuned MarkupLM performed better than a fine-tuned BERT on phish/malicious website classification. The final goal of my project is to create a browser extension with the added benefit that the user's data stays local to their machine like you pointed out.

oh wow nice!

I think the html to selenium code generation is good one!

yep i wonder if thats all thats needed for an agent

Another idea I was thinking about was automatic page re-orientation like how browsers offer "reader mode" for some websites. The model/app could optimize user experience on "clunky" pages (move text around, resize images, summarize or hide irrelevant sectons/nodes/paragraphs). Even tutorial use-cases similar to what "WalkMe" does could be leveraged.

oh wow reader mode would be a great feature thats a good idea

huggingface / transformers.js