philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
Mozilla Public License 2.0
1.02k stars 29 forks source link

.filter does not output an array or behave as .map #49

Closed dynabler closed 6 months ago

dynabler commented 6 months ago

.map and .filter don't behave the same. I looked at js.go and the only difference between the 2 is "Each" for .filter and Map for .map. I was expecting .filter to behave the same as .map with the added benefit of filtering elements.

From README => item.text()) // ["Item 1", "Item 2", "Item 3"] Does work.

From README items.get(1).siblings() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]

items.get(1).siblings() gives "WARNING": "Forgot to call text(), html() or attr()?", "length": N items.get(1).siblings().text() gives all siblings in text-on-one-line-with-no-spaces-and-no-html-tags items.get(1).siblings().html() gives only the first element with HTML tags

From README items.filter(item => item.hasClass("a")) // [<li class="a">Item 1</li>]

items.filter(item => item.hasClass("a")) gives "WARNING": "Forgot to call text(), html() or attr()?", "length": N (it does know the length)

It's unclear to me where .text() should be added. No matter where I put it, the error is the same panic: TypeError: Object has no member 'text' at stdin_default

philippta commented 6 months ago

The .map function tansforms a list of things into a list of other things, e.g. a list of HTML nodes into a list of texts.

The .filter function filters down a list of things, which means that a list of HTML nodes stay a list of HTML nodes. If you have a list of HTML <a ...> tags, want to filter them and extract the href, you have to filter and map. This is very common in JavaScript.


const links = doc.find("a");
// List of HTML nodes:
// - <a href="page1">Page 1</a>
// - <a href="page2" class="active">Page 2</a>
// - <a href="page3">Page 3</a>

const inactiveLinks = links.filter(link => !link.hasClass("active"));
// Filtered list of HTML nodes:
// - <a href="page1">Page 1</a>
// - <a href="page3">Page 3</a>

const urlsOfInactiveLinks = => link.attr("href"))
// List of urls:
// - page1
// - page3

Or as a one-liner:

const urlsOfInactiveLinks = doc.find("a").filter(a => !a.hasClass("active")).map(a => a.attr("href"))

The way .siblings() works is similar to a filter. It will give you a list of HTML nodes, which you have to transform using .map() again.

dynabler commented 6 months ago

AND being the most important word here. You need both .filter AND .map. So, to recap:

Thanks for clarifying! Much appreciated.

philippta commented 6 months ago

AND being the most important word here. You need both .filter AND .map. So, to recap:


Thanks for clarifying! Much appreciated.

Any time.