jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
7.05k stars 555 forks source link

can not parse image from <picture> tag #149

Closed DophinL closed 2 weeks ago

DophinL commented 3 weeks ago
fetch('https://r.jina.ai/https://www.cnbc.com/2024/10/29/sam-altmans-reddit-stake-now-worth-over-1-billion-after-earnings-pop.html', {
  method: 'GET',
  headers: {
    "Accept": "application/json",
    "X-With-Images-Summary": "true",
      "X-Target-Selector": ".ArticleBody-articleBody"
  },
})

You can see from the execution result that the main image has not been retrieved.

image

Please support the tag. I'm a paid user, thanks 🫡

nomagick commented 3 weeks ago

Thanks for reporting.

I have just updated the service to expose the image in <picture>.

DophinL commented 3 weeks ago

Thanks for reporting.

I have just updated the service to expose the image in <picture>. Amazing speed, thank you so much!

DophinL commented 1 week ago

@nomagick seems not work on this site, didn't get the main picture:

fetch('https://r.jina.ai/https://www.foxsports.com/stories/mlb/chris-sale-aaron-judge-shohei-ohtani-highlight-all-mlb-2024-awardees', {
  method: 'GET',
  headers: {
    "Accept": "application/json",
    "X-With-Images-Summary": "true",
      "X-Target-Selector": ".story-content-main"
  },
}).then(res => res.json()).then(res => console.log('jina resp',res))
image
DophinL commented 1 week ago

This website is in the same situation:

fetch('https://r.jina.ai/https://cryptodaily.co.uk/2024/11/crypto-price-analysis-11-14-bitcoin-btc-ethereum-eth-solana-sol-ripple-xrp-dogecoin-doge-uniswap-uni-immutable-imx', {
  method: 'GET',
  headers: {
    "Accept": "application/json",
    "X-With-Images-Summary": "true",
      "X-Target-Selector": ".news-details-layout1"
  },
}).then(res => res.json()).then(res => console.log('jina resp',res))
nomagick commented 1 week ago

This was because it only has a srcset and does not have an src.

Now I've modified Reader a bit to consider srcset.

DophinL commented 6 days ago

This was because it only has a srcset and does not have an src.

Now I've modified Reader a bit to consider srcset.

Thanks for the quick response! It seems it hasn't taken effect yet. Has it not been released?

nomagick commented 6 days ago

This was because it only has a srcset and does not have an src. Now I've modified Reader a bit to consider srcset.

Thanks for the quick response! It seems it hasn't taken effect yet. Has it not been released?

It had been deployed days ago. The image did not show up for this particular page because the Readability lib we use intentionally removed the block while trying to clean up the page. Try the markdown return format to inform Reader not to use Readability:

curl https://r.jina.ai/https://www.foxsports.com/stories/mlb/chris-sale-aaron-judge-shohei-ohtani-highlight-all-mlb-2024-awardees \
  -H 'x-return-format: markdown' \
  -H 'x-target-selector: .story-content-main'