philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.02k stars 29 forks source link

.filter does not output an array or behave as .map #49

Closed dynabler closed 6 months ago

dynabler commented 6 months ago

.map and .filter don't behave the same. I looked at js.go and the only difference between the 2 is "Each" for .filter and Map for .map. I was expecting .filter to behave the same as .map with the added benefit of filtering elements.

From README items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"] Does work.

From README items.get(1).siblings() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]

items.get(1).siblings() gives "WARNING": "Forgot to call text(), html() or attr()?", "length": N items.get(1).siblings().text() gives all siblings in text-on-one-line-with-no-spaces-and-no-html-tags items.get(1).siblings().html() gives only the first element with HTML tags

From README items.filter(item => item.hasClass("a")) // [<li class="a">Item 1</li>]

items.filter(item => item.hasClass("a")) gives "WARNING": "Forgot to call text(), html() or attr()?", "length": N (it does know the length)

It's unclear to me where .text() should be added. No matter where I put it, the error is the same panic: TypeError: Object has no member 'text' at stdin_default

philippta commented 6 months ago

The .map function tansforms a list of things into a list of other things, e.g. a list of HTML nodes into a list of texts.

The .filter function filters down a list of things, which means that a list of HTML nodes stay a list of HTML nodes. If you have a list of HTML <a ...> tags, want to filter them and extract the href, you have to filter and map. This is very common in JavaScript.

Example:

const links = doc.find("a");
// List of HTML nodes:
// - <a href="page1">Page 1</a>
// - <a href="page2" class="active">Page 2</a>
// - <a href="page3">Page 3</a>

const inactiveLinks = links.filter(link => !link.hasClass("active"));
// Filtered list of HTML nodes:
// - <a href="page1">Page 1</a>
// - <a href="page3">Page 3</a>

const urlsOfInactiveLinks = activeLinks.map(link => link.attr("href"))
// List of urls:
// - page1
// - page3

Or as a one-liner:

const urlsOfInactiveLinks = doc.find("a").filter(a => !a.hasClass("active")).map(a => a.attr("href"))

The way .siblings() works is similar to a filter. It will give you a list of HTML nodes, which you have to transform using .map() again.

dynabler commented 6 months ago

AND being the most important word here. You need both .filter AND .map. So, to recap:

Thanks for clarifying! Much appreciated.

philippta commented 6 months ago

AND being the most important word here. You need both .filter AND .map. So, to recap:

Exactly!

Thanks for clarifying! Much appreciated.

Any time.