Use `extract_first` methods and set `''` as default value

matiskay commented 9 years ago

The current pattern in the code to handle first element extraction is the following

        try:
            total_string = response.css('#LblTotal').xpath('./text()').extract()[0]
        except:
            total_string = ''

This pattern is pretty ugly the best way to do it is to use extract_first(default='') method.

sel.xpath('//div/[id="not-exists"]/text()').extract_first(default='not-found')
'not-found'

Documentation: http://doc.scrapy.org/en/latest/topics/selectors.html#id1

matiskay commented 9 years ago

Aqui estan las lineas de codigo que usan el patron extract()[0]

./inpe.py:70:                item['full_name'] = fields[0].xpath("text()").extract()[0].strip()
./inpe.py:73:                    item['id_document'] = fields[1].xpath("text()").extract()[0].strip()
./inpe.py:77:                item['id_number'] = fields[2].xpath("text()").extract()[0].strip()
./inpe.py:78:                item['entity'] = fields[3].xpath("text()").extract()[0].strip()
./inpe.py:79:                item['reason'] = fields[4].xpath("text()").extract()[0].strip()
./inpe.py:80:                item['host_name'] = fields[5].xpath("text()").extract()[0].strip()
./inpe.py:81:                item['title'] = fields[6].xpath("text()").extract()[0].strip()
./inpe.py:82:                item['office'] = fields[7].xpath("text()").extract()[0].strip()
./inpe.py:86:                item['time_start'] = times[1].xpath("text()").extract()[0].strip()
./minem.py:96:            item['entity'] = re.sub("\s+", " ", fields[3].xpath("text()").extract()[0].strip())
./minem.py:97:            item['host_name'] = re.sub("\s+", " ", fields[5].xpath("text()").extract()[0].strip())
./minem.py:98:            item['reason'] = re.sub("\s+", " ", fields[4].xpath("text()").extract()[0].strip())
./minem.py:99:            item['title'] = re.sub("\s+", " ", fields[6].xpath("text()").extract()[0].strip())
./minem.py:100:            item['office'] = re.sub("\s+", " ", fields[7].xpath("text()").extract()[0].strip())
./minem.py:101:            item['time_start'] = re.sub("\s+", " ", fields[8].xpath("text()").extract()[0].strip())
./minem.py:104:                document_identity = fields[2].xpath("text()").extract()[0].strip()
./minem.py:112:                item['time_end'] = re.sub("\s+", " ", fields[9].xpath("text()").extract()[0].strip())
./mtc.py:17:        event_validation = response.xpath('//input[@id="__EVENTVALIDATION"]/@value').extract()[0]
./produce.py:61:                    item['time_start'] = this_record[2].xpath('text()').extract()[0]
./produce.py:66:                    item['full_name'] = this_record[3].xpath('text()').extract()[0]
./produce.py:71:                    item['id_document'] = this_record[4].xpath('text()').extract()[0]
./produce.py:76:                    item['id_number'] = this_record[5].xpath('text()').extract()[0]
./produce.py:81:                    item['reason'] = this_record[6].xpath('text()').extract()[0]
./produce.py:86:                    item['host_name'] = this_record[7].xpath('text()').extract()[0]
./produce.py:91:                    item['office'] = this_record[8].xpath('text()').extract()[0]
./produce.py:96:                    item['time_end'] = this_record[9].xpath('text()').extract()[0]
./tc.py:58:                        item['full_name'] = sel.xpath('td')[2].xpath('text()').extract()[0]
./tc.py:63:                        item['id_document'] = sel.xpath('td')[3].xpath('text()').extract()[0]
./tc.py:68:                        item['id_number'] = sel.xpath('td')[4].xpath('text()').extract()[0]
./tc.py:73:                        item['reason'] = sel.xpath('td')[5].xpath('text()').extract()[0]
./tc.py:78:                        item['host_name'] = sel.xpath('td')[6].xpath('text()').extract()[0]
./tc.py:83:                        item['time_start'] = sel.xpath('td')[1].xpath('text()').extract()[0]
./tc.py:88:                        item['time_end'] = sel.xpath('td')[8].xpath('text()').extract()[0]
./tc.py:100:                        item['full_name'] = sel.xpath('td')[2].xpath('text()').extract()[0]
./tc.py:105:                        item['id_document'] = sel.xpath('td')[3].xpath('text()').extract()[0]
./tc.py:110:                        item['id_number'] = sel.xpath('td')[4].xpath('text()').extract()[0]
./tc.py:115:                        item['reason'] = sel.xpath('td')[5].xpath('text()').extract()[0]
./tc.py:120:                        item['host_name'] = sel.xpath('td')[6].xpath('text()').extract()[0]
./tc.py:125:                        item['time_start'] = sel.xpath('td')[1].xpath('text()').extract()[0]
./tc.py:130:                        item['time_end'] = sel.xpath('td')[7].xpath('text()').extract()[0]
./tc.py:142:                        item['full_name'] = sel.xpath('td')[1].xpath('text()').extract()[0]
./tc.py:147:                        item['id_document'], item['id_number'] = utils.get_dni(sel.xpath('td')[2].xpath('text()').extract()[0])
./tc.py:153:                        item['entity'] = sel.xpath('td')[3].xpath('text()').extract()[0]
./tc.py:158:                        item['reason'] = sel.xpath('td')[4].xpath('text()').extract()[0]
./tc.py:163:                        item['host_name'] = sel.xpath('td')[5].xpath('text()').extract()[0]
./tc.py:168:                        item['office'] = sel.xpath('td')[6].xpath('text()').extract()[0]
./tc.py:173:                        item['time_start'] = sel.xpath('td')[7].xpath('text()').extract()[0]
./tc.py:178:                        item['time_end'] = sel.xpath('td')[8].xpath('text()').extract()[0]

y las spider que usan este patron son.

./inpe.py
./minem.py
./mtc.py
./produce.py
./tc.py

Nota

Lista el nombre de los scripts y la linea donde aparecen los archivos. extract()[0]. grep -nR 'extract()\[0\]' .
Lista el nombre de los scripts que usan el patron grep -nR 'extract()\[0\]' . | cut -d ':' -f 1 | uniq

aniversarioperu commented 9 years ago

asu, “Hay, hermanos, muchísimo que hacer”

matiskay commented 9 years ago

The only spider that is reminding is TcSpider. I think we can close because there is no consistency in the visit pages for "Tribunal Consitucional".

manolo-rocks / manolo_scraper

Use `extract_first` methods and set `''` as default value #28

Nota