您好,AyugeSpiderTools是否可以增加对elascticsearch的支持

joykerl commented 10 months ago

Is your feature request related to a problem? Please describe. 原来使用的关系型数据库，数据爬到一定量级时，查询索引越来越慢，优化难度很大

Describe the solution you'd like 希望作者考虑一下增加对elascticsearch的入库支持，通过对es的常规配置和item进行封装，从而大幅减少项目的多次重复配置

Describe alternatives you've considered scrapy入库elascticsearch的场景蛮多的，每天爬取导致日益增加的数据索引降低，需要es这样专业的索引框架解决问题

Additional context 感谢您的开源库，使得我们写爬虫大幅提高生产效率，希望AyugeSpiderTools越来越强大

shengchenyang commented 10 months ago

经考虑，目前 AyugeSpiderTools 支持 es 有以下困惑，并非困难：

elasticsearch-dsl-py 版本依赖问题，不同 es 版本对应不同的 elasticsearch-dsl-py 依赖，且无法确定用户在安装本库前需要的具体版本，这是个小问题，可添加日志提示用户。
不同版本的 elasticsearch-dsl-py 的 api 有所不同，不太好通过改写 AyuItem DataItem 来优雅地兼容 es DocType/Document 的声明。
接下来就只能通过用户传入 es DocType/Document 的方式了，这就等于做了一半，我不觉得这样比直接手写更方便。

注：此示例请查看 DemoSpider 中的 demo_es，请确认 pipelines.py 中的连接信息。记得安装 elasticsearch-dsl-py 依赖。示例是 es v8.6.0 版本，可以先查看此类实现方式能接受吗。

connections.create_connection(hosts="http://localhost:9200")

# 这是方式 2 实现的伪代码:
from ayugespidertools.items import AyuItem, DataItem
from elasticsearch_dsl import Keyword, Text

book_info_item = AyuItem(
    book_name=DataItem(
        book_name, Text(analyzer="snowball", fields={"raw": Keyword()})
    ),
    book_href=DataItem(book_href, Keyword()),
    book_intro=DataItem(book_intro, Keyword()),
    _table=DataItem("demo_es", "es_index_name"),
)

# 以达到下面类似的功能
class ArticleType(Document):
    book_name = Text(analyzer="snowball", fields={"raw": Keyword()})
    book_href = Keyword()
    book_intro = Keyword()

    class Index:
        name = "demo_es"
        settings = {
            "number_of_shards": 2,
        }

通过 DemoSpider demo_es 中的示例可知，对应 pipelines 中的主要代码就 4 行，所以我更推荐开发者自行添加自己需要的 es 版本依赖代码并 build，或者先自行仿照 DemoSpider 中的 demo_es 来实现功能。

此 feature 暂时保持开启，我再考虑下适配的方式。

shengchenyang commented 10 months ago

经再次考虑，还是得需要方式 2 的实现方式，方式 3 的实现还是因为太丑陋而无法接受。

那么， es 的支持将使用方式 2 的方式开发，但是我将只支持 es 最近版本的方式，以防止维护地狱。

推荐有 py 经验的用户可以仿照着来自行 build 自己的专属库。

joykerl commented 10 months ago

非常感谢，这两天测试一下

shengchenyang commented 10 months ago

feat: add es support 的 commit 已经基本完成了 es 的支持，已经发您 pre-release 版本包，若有问题可随时反馈。我会在完善后再发布。

shengchenyang commented 10 months ago

支持 elasticsearch 的 3.9.4 版本已发布一段时间，我将关闭此 issues。若使用中遇到问题可创建新 issues。

shengchenyang / AyugeSpiderTools

您好,AyugeSpiderTools是否可以增加对elascticsearch的支持 #15