How to extract the news detail page? 新闻详情页怎么提取？

annian101 commented 8 months ago

请问一下新闻详情页怎么提取？

platonai commented 7 months ago

val url = "https://www.eeo.com.cn/2024/0330/648712.shtml"
val session = ScentContexts.createSession()
val document = session.harvestArticle(url, session.options())

println(document.contentTitle)
println(document.textContent)

eeo.com.cn crawler

platonai commented 7 months ago

If you need a open source solution, use the code below:

    fun harvestArticle(page: WebPage): TextDocument {
        return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) }
    }

ChineseNewsExtractor is implemented in PulsarRPA.

annian101 commented 7 months ago

val url = "https://www.eeo.com.cn/2024/0330/648712.shtml"
val session = ScentContexts.createSession()
val document = session.harvestArticle(url, session.options())

println(document.contentTitle)
println(document.textContent)

eeo.com.cn爬虫

请问您这个是新闻类网站通用的吗？我看您代码目录里有分百度新闻网站、eeo新闻网站这些等等，如果我应用于这些网站之外的网站进行详情页获取，是不是还能获取到？

annian101 commented 7 months ago

如果您需要开源解决方案，请使用以下代码：
    fun harvestArticle(page: WebPage): TextDocument {
        return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) }
    }
ChineseNewsExtractor在 PulsarRPA 中实现。

还有大佬，请问下Exotic可以提取详情页吗？

ZhujingJava commented 7 months ago

val url = "https://www.eeo.com.cn/2024/0330/648712.shtml"
val session = ScentContexts.createSession()
val document = session.harvestArticle(url, session.options())

println(document.contentTitle)
println(document.textContent)
eeo.com.cn爬虫
请问您这个是新闻类网站通用的吗？我看您代码目录里有分百度新闻网站、eeo新闻网站这些等等，如果我应用于这些网站之外的网站进行详情页获取，是不是还能获取到？

不同的网站元素结构不同，每家公司网站都需要单独编写逻辑，比如amazon，zhihu，jd等等。

platonai commented 7 months ago

如果您需要开源解决方案，请使用以下代码：
    fun harvestArticle(page: WebPage): TextDocument {
        return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) }
    }
ChineseNewsExtractor在 PulsarRPA 中实现。
还有大佬，请问下Exotic可以提取详情页吗？

项目主页 README 有介绍。

更多信息：

https://www.bilibili.com/video/BV1qV411R7Xq/ 这个视频介绍了我们的 AI 技术如何准确理解网页上的每一个字段，并且将网页转变为结构化数据或者Excel表格。使用无监督学习+监督学习进行网页数据提取，我们将网页数据提取的人效提升了1000倍以上，提升了数据提取准确率，降低了人员技能要求，同时也不再需要频繁维护数据提取规则。

http://platonic.fun/i/ai?url=aHR0cHM6Ly93d3cuaHVhLmNvbS9tZWlndWkv 这是 AI 技术准确理解并提取网页字段的实时演示。

https://www.bilibili.com/video/BV1Zi4y1h7aq/

platonai commented 7 months ago

不同的网站元素结构不同，每家公司网站都需要单独编写逻辑，比如amazon，zhihu，jd等等。

项目主页 README 有介绍。

更多信息：

https://www.bilibili.com/video/BV1qV411R7Xq/ 这个视频介绍了我们的 AI 技术如何准确理解网页上的每一个字段，并且将网页转变为结构化数据或者Excel表格。使用无监督学习+监督学习进行网页数据提取，我们将网页数据提取的人效提升了1000倍以上，提升了数据提取准确率，降低了人员技能要求，同时也不再需要频繁维护数据提取规则。

http://platonic.fun/i/ai?url=aHR0cHM6Ly93d3cuaHVhLmNvbS9tZWlndWkv 这是 AI 技术准确理解并提取网页字段的实时演示。

https://www.bilibili.com/video/BV1Zi4y1h7aq/

xieliaing commented 3 months ago

你好，联系贵公司电子邮箱，但是没有回复，请问如何接洽。

galaxyeye commented 3 months ago

感谢您的关注。

您可以直接加我微信: galaxyeye, 非常感谢。

Wechat: galaxyeye Weibo: galaxyeye Email: @., @. Twitter: galaxyeye8 Website: platon.ai

Liang Xie @.***> 于2024年8月10日周六 14:22写道：

你好，联系贵公司电子邮箱，但是没有回复，请问如何接洽。

— Reply to this email directly, view it on GitHub https://github.com/platonai/PulsarRPAPro/issues/19#issuecomment-2279656466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM7MS5V45HORR3G6DNOJJLZQWWRRAVCNFSM6AAAAABFMNONNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZZGY2TMNBWGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- 开放商品云张南

Email: @.*** 微信: galaxyeye QQ: 263206207 手机: 18621538660

xieliaing commented 3 months ago

好的，我加你微信。我同事也会通过公司电子邮件联系你 Thank you,Liang

On Sunday, August 18, 2024 at 12:55:16 PM GMT+9, Vincent Zhang ***@***.***> wrote:

感谢您的关注。

您可以直接加我微信: galaxyeye, 非常感谢。

Wechat: galaxyeye Weibo: galaxyeye Email: @., @. Twitter: galaxyeye8 Website: platon.ai

Liang Xie @.***> 于2024年8月10日周六 14:22写道：

你好，联系贵公司电子邮箱，但是没有回复，请问如何接洽。

— Reply to this email directly, view it on GitHub https://github.com/platonai/PulsarRPAPro/issues/19#issuecomment-2279656466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM7MS5V45HORR3G6DNOJJLZQWWRRAVCNFSM6AAAAABFMNONNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZZGY2TMNBWGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- 开放商品云张南

Email: @.*** 微信: galaxyeye QQ: 263206207 手机: 18621538660

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

platonai / PulsarRPAPro

How to extract the news detail page? 新闻详情页怎么提取？ #19