mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
14.24k stars 1.03k forks source link

[Bug] Extraneous back slashes #662

Closed iuliaturc closed 1 day ago

iuliaturc commented 5 days ago

When scraping https://huggingface.co/docs/transformers/main_classes/pipelines, I'm seeing a lot of back slashes:

Screenshot 2024-09-12 at 3 19 26 PM

Firecrawl Markdown:

### FillMaskPipeline\
\
### classtransformers.FillMaskPipeline\
\
[<source>](https://github.com/huggingface/transformers/blob/v4.21.2/src/transformers/pipelines/fill_mask.py#L34)\
\
(model: typing.Union\[ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')\]tokenizer: typing.Optional\[transformers.tokenization\_utils.PreTrainedTokenizer\] = Nonefeature\_extractor: typing.Optional\[ForwardRef('SequenceFeatureExtractor')\] = Nonemodelcard: typing.Optional\[transformers.modelcard.ModelCard\] = Noneframework: typing.Optional\[str\] = Nonetask: str = ''args\_parser: ArgumentHandler = Nonedevice: int = -1binary\_output: bool = False\*\*kwargs)\
\
Parameters\
\
- **model** ( [PreTrainedModel](/docs/transformers/v4.21.2/en/main_classes/model#transformers.PreTrainedModel) or [TFPreTrainedModel](/docs/transformers/v4.21.2/en/main_classes/model#transformers.TFPreTrainedModel)) —\
The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\
[PreTrainedModel](/docs/transformers/v4.21.2/en/main_classes/model#transformers.PreTrainedModel) for PyTorch and [TFPreTrainedModel](/docs/transformers/v4.21.2/en/main_classes/model#transformers.TFPreTrainedModel) for TensorFlow.\

Note these back slashes don't always show up. For instance, when I scrape https://huggingface.co/transformers/main_classes/tokenizer.html#transformers, I get cleaner Markdown:

Screenshot 2024-09-12 at 3 22 52 PM
## PreTrainedModel

### classtransformers.PreTrainedModel

[<source>](https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/modeling_utils.py#L1297)

(config: PretrainedConfig\*inputs\*\*kwargs)

Base class for all models.

[PreTrainedModel](/docs/transformers/v4.44.2/en/main_classes/model#transformers.PreTrainedModel) takes care of storing the configuration of the models and handles methods for loading,
downloading and saving models as well as a few methods common to all models to:
nickscamara commented 5 days ago

Interesting, ccing @tomkosm here.

rafaelsideguide commented 1 day ago

Hey @iuliaturc thanks for bringing this up! The backslashes you’re seeing are actually due to the way our markdown parser handles text that’s part of a link or button. In this case, the text you’re referring to is likely inside an expandable block (with the "expand 14 parameters" button). The parser adds these backslashes to preserve the link functionality within markdown.

We’ll be closing this issue as "not planned," but feel free to reopen it or create a new issue if needed. Let me know if you have any further questions!

iuliaturc commented 20 hours ago

Thanks for the explanation!