scrapinghub / portia2code

BSD 3-Clause "New" or "Revised" License
49 stars 25 forks source link

Have issue with "#portia-content": "#dummy" #8

Open xieyanfu opened 6 years ago

xieyanfu commented 6 years ago

We will have "#portia-content": "#dummy" in portia project file like below,

{ "annotations": { "#portia-content": "#dummy" }, "container_id": null, "id": "c845-4bb7-8a7b", "item_container": true, "repeated": false, "required": [], "schema_id": "b8f6-4365-8081", "selector": null, "siblings": 0, "tagid": null, "text-content": "#portia-content" }, { "accept_selectors": [ "[itemtype=\"http://schema.org/Product\"] [itemprop=\"name\"]" ], "container_id": "c845-4bb7-8a7b", "data": { "3835-4eef-8b4c": { "attribute": "content", "extractors": {}, "field": "b25b-43ce-904b", "required": false } }, "id": "a338-4038-ace9", "text-content": "content", "post_text": null, "pre_text": null, "reject_selectors": [], "required": [], "repeated": false, "selection_mode": "css", "selector": "[itemtype=\"http://schema.org/Product\"] [itemprop=\"name\"]", "tagid": null, "xpath": "//" }

The "#portia-content": "#dummy" part is generated by portia, and it is set as container of other selectors. But when porting portia project with "portia2code", we will have empty Items "items = [[]]" in scrapy spider class.

It is caused by the container_to_item in utils.py,

if not selector: return None

It should be

if not selector: selector = ''

xieyanfu commented 6 years ago

Maybe should change to

if not selector: selector = 'html' if selector_type == 'css' else ''