scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Scrapy.selector Enhancement Proposal #906

Closed zwidny closed 8 years ago

zwidny commented 9 years ago

Make scrapy.selector into a separate project.

When we're scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this: BeautifulSoup, lxml. And although scrapy selectors are built over the lxml library, it is a better way to extract data from HTML source. In my opinion, people will prefer selector to BeautifulSoup and lxml , if make the scrapy.selectors into a separate project,

kmike commented 9 years ago

+1 from me to a general idea of extracting Selector to a separate library. But I haven't checked the details. See also https://github.com/scrapy/scrapy/issues/395#issuecomment-25055026. Selector can already be used without the rest of scrapy: Selector(text=html), but I like an idea of making the separation cleaner if it is possible. To move forward this needs a more detailed discussion/proposal, +1s from other Scrapy commiters and a person who will actually implement it :)

zwidny commented 9 years ago

All right! Thank you:)

umrashrf commented 9 years ago

+1

I thought of this too a while ago. While using requests for on-demand scraping, I installed Scrapy just to use the chained Selectors.

dangra commented 9 years ago

+1 the hard part is how to untie scrapy.http.Response and the now over-engineered parsing-cache provided by scrapy.selector.lxmldocument.LxmlDocument. I guess accepting Response objects (and its caching) can be implemented inside Scrapy extending from the Selector class provided by the separate library.

Digenis commented 9 years ago

+1 I totally agree with dangra. Also, this "hard part", will prevent the base Selector from clutter in the future. And this doesn't endanger the Selector, it can not just die aside on a separate project.

umrashrf commented 9 years ago

@dangra @kmike I'm interested to do it.

I think there are two steps to the PR.

Digenis commented 9 years ago

I don't like the idea of copy-pasting the selectors code to another repository. All the code history disappears. I'd even prefer forking the scrapy repository it self and start from there, instead of creating a partial copy of the current state of a subdirectoy.

umrashrf commented 9 years ago

@Digenis good point. I will test this with temporary selectors repo and see how it goes.

kmike commented 9 years ago

It may be possible to move just one folder without forking the repo; I haven't tried it, but something like http://gbayer.com/development/moving-files-from-one-git-repository-to-another-preserving-history/ may work.

Digenis commented 9 years ago

I do occasionally play with such things. Of course we will not end up with a "honest" history and it will take several filter-branch commands to get there.

I hope I don't make it seem complicated but I am suggesting that since we are rewriting history we should keep a meta-history of how we rewrote it.

I think the deliverables of this milestone should not be just the new repository but also a list of git commands or a small shell script that create it (reproducible result).

We should be able on an empty repo to add and fetch scrapy as a remote and after checking out some specific reference, run this list of commands to create the individual repo.

That way we can see the actions we take to strip-down scrapy to the selectors. Otherwise we are left only with the final result of each individual attempt to rewrite, with no way to frame the changes (we will not have diffs to read) and have something we can understand and comment on.

umrashrf commented 9 years ago

@Digenis I understand filter-branch will only bring in history to some extent. So the advantage of using your approach over filter-branch is that people get to see where did Selectors repo come from right? If I fork Scrapy and cut it short to Selectors then it answers that question.

We should be able on an empty repo to add and fetch scrapy as a remote and after checking out some specific reference, run this list of commands to create the individual repo.

Why is that required? Why would someone like to do that if there already exists Selectors repo?

Digenis commented 9 years ago

I probably got to much into details towards the end. I only argue about showing what you did to separate the selectors. Of course people will know where the repo comes from.

I wanted to say that once you fork and start rewriting history, you will be updating all the references and your work on it will not be visible as in a typical version control workflow. I expect this to be interesting enough to be exposed. eg: by publishing the commands you ran to achieve this.

dangra commented 9 years ago

I second @Digenis on this, I splitted several repositories in the past and it never was in a single pass so I found the best way is to write a script (bash) to rerun on each debug iteration.

Once you wrote the script it is simple to make it public.

umrashrf commented 9 years ago

I can create a script. First I will see how it pans out with my commits over Scrapy's history. It's just now that I realized what's actually the hardest part :)

umrashrf commented 9 years ago

Just to update you guys, I have spend some time with git-filter-branch and git-subtree and it seems to me git-filter-branch is the way to go.

You can see a working script which extracts selector and related modules from scrapy into separate branch. https://gist.github.com/umrashrf/2f090797bdc25857325b

It's still work in progress as I have to apply my changes on it to make selectors reusable. I will soon update the PR.

kmike commented 9 years ago

Great, thanks Umair!

umrashrf commented 9 years ago

My pleasure @kmike

nyov commented 9 years ago

Sorry to find this at such a late stage. As to history filtering, in this instance I don't so much see the necessity to keep the history (duplicated) in a new repository after splitting Selector code. The old history is still right here! Put a link or reference in the new root commit to the scrapy commit it was filtered out of, maybe push a git-replace "graftpoint" as well, and be done with it.

Otherwise there is subtree filtering and manual tree grafting, but as noted before it wouldn't be an honest history, missing some files where code was moved to/from in the meantime or renamed.

umrashrf commented 9 years ago

@nyov I think they are okay with history provided by git-filter-branch. They need the script for historical reasons. The script will be outdated for sure but at any point in future they can see how was it brought into existence.

After writing script, I found that it's indeed better approach to split the repo as you don't need to keep tabs on everything in your head :)

Digenis commented 9 years ago

No, not historical.

Each line of code exists to do something specific and hopefully understandable but the trial and error process that made someone write spawns a longer story.

It often complements well, even "self-documenting" code. Because one day you will end up reading something counter-intuitive that can make you waste time figuring out "how" instead of looking in the history and stopping in the "why".

To explain it in one sentence: "git blame gives you what you couldn't find in the code comments"

umrashrf commented 9 years ago

So @Digenis are you happy with my current approach to split it? Can you leave your comments at https://github.com/scrapy/scrapy/pull/1007. I pushed script with patches so you can see https://github.com/umrashrf/scrapy/commit/dc6565d989a08756335dcbd514e369fd0a78c274.

nyov commented 9 years ago

Hmm, I don't think you got my point. The git blame history of the project is still right here. With a git-replace ref "symlink" or pointer, the whole history of the newly detached project can be recovered by someone merging both project repositories locally (read the git-replace link I posted earlier).

But if you generate a custom, filtered, history to include into the new project, you rewrote history, thereby invalidating git blame's usefulness. filter-branch doesn't trivially cover merges, meaningful commit information may get lost. Doing it "wrong", you'll also overwrite the real commit author and date information. (Ofc that's actually the right way, since you rewrote history.) So by creating a new history this way, you instead make it harder for someone to get the correct, canonical history (besides publishing a duplicated but different history).

I'm just saying this from my experience from rewriting and grafting a lot of project histories together (from svn imports and what not), I believe the filter-branch approach is the wrong one here instead of the git-replace approach. OTOH if everyone else agrees on filter-branch, understanding these points I raised, don't let me stop you from going down that road.

edit: It's also possible to do both approaches. Maybe that'd satisfy everyone: Publishing this rewritten history, and adding a git-replace ref at the time of the split, someone could override the generated history by pulling in the real one from the scrapy repo.

umrashrf commented 9 years ago

I think you make good points @nyov. I will play with git-replace over the weekend.

@Digenis @dangra @kmike what do you think about this approach?

Digenis commented 9 years ago

I think nyov's made a better suggestion. I said "history", at some point, then "filter-branch" and didn't think further. Indeed it would be more convenient if nyov saw this earlier.

Still I find your script interesting and something I would refer to next time I 'll be doing complicated filter-branch sequences.

umrashrf commented 9 years ago

Guys how this history looks like? :sunglasses: https://github.com/umrashrf/selectors/commits/selectors-20150329

FYI Added two custom references (ROOT for selectors and SCRAPY_HEAD for scrapy commit from where it was detached). There are instructions in root commit as how to bring back full history.

Edit 1: Here are the steps I took to create it https://gist.github.com/umrashrf/3538a4394561007a753d

nyov commented 9 years ago

Oooh, lot's of work there :) Nice and clean history.

If I may say, you could drop the whole rewriting and your current root commit. That's only taking up git-object space from now until forever, when you're deleting most of the files in the next commit anyway. After someone adds the git-replace ref, the root commit disappears from history, so it could be an empty commit -- I would take your 2nd commit instead, where you cleaned up (removed scrapy stuff but kept selectors stuff) as root.

And where you have git replace ROOT SCRAPY_HEAD in the commit message, we need actual SHA1 refs, where ROOT is the new root commit (obviously) and SCRAPY_HEAD must be the point in time where you took the code from scrapy. (Ideally that would be last commit before selectors get removed from the scrapy repo.) Just pointing people at scrapy's HEAD won't work, as master will keep changing after the split, and they'll have to guess at what point in time those histories matched.

Maybe write something fancy for the new root commit:

Selectors are dead, long live Selectors

This project been split out of scrapy
(https://github.com/scrapy/scrapy).

To resolve the full history in your local copy,
first fetch the remaining history:

    git remote add scrapy git@github.com:scrapy/scrapy.git
    git fetch scrapy

Then add this git-replace ref:

    git replace dafd9f 0c7482

And thanks for all the work.

umrashrf commented 9 years ago

If I may say, you could drop the whole rewriting and your current root commit. That's only taking up git-object space from now until forever, when you're deleting most of the files in the next commit anyway. After someone adds the git-replace ref, the root commit disappears from history, so it could be an empty commit -- I would take your 2nd commit instead, where you cleaned up (removed scrapy stuff but kept selectors stuff) as root.

Are you suggesting something like this?

mkdir selectors
pushd selectors 
git init
git commit --allow-emtpy -m "born out of scrapy/scrapy"
popd
cp -r scrapy/* selectors/

And where you have git replace ROOT SCRAPY_HEAD in the commit message, we need actual SHA1 refs, where ROOT is the new root commit (obviously) and SCRAPY_HEAD must be the point in time where you took the code from scrapy. (Ideally that would be last commit before selectors get removed from the scrapy repo.) Just pointing people at scrapy's HEAD won't work, as master will keep changing after the split, and they'll have to guess at what point in time those histories matched.

You're right about SCRAPY_HEAD as it's confusing. I also thought of SCRAPY_TIP but it doesn't sound good also. I am out of ideas for the name :( I had actual SHA there before but I think it is not good for people who don't know about git-replace and naming them gives a hint before looking at the commit itself.

Thanks for the bad-ass commit message!

nyov commented 9 years ago

@umrashrf, I lost you there on IRC.

This is how I thought it could look like: https://github.com/nyov/selectors/commits/master

umrashrf commented 9 years ago

thanks @nyov, it makes sense. My concern was if git-replace will handle the moves like scrapy/selector/ -> selectors/ and scrapy/utils -> selectors/utils without having parent commit. I need to test this and will update the repo shortly.

nyov commented 9 years ago

Well best don't move the files around in the first two commits. That makes it cleaner in the history. But git itself doesn't care, it will show the files as "moved" if they look similar by a certain percentage. Otherwise it will show them as "removed" and "added". It's not preserved in git objects if a file was moved, git doesn't care where the parent commit came from.

umrashrf commented 9 years ago

I just realized it must not be a problem because if git-diff can do it so can git-replace. If it was the other way around ("removed" and "added") then git-blame would be affected.

umrashrf commented 9 years ago

@nyov please review it when you can https://github.com/umrashrf/selectors/commits/selectors

nyov commented 9 years ago

This looks good to me!

Now looking at the commits, your new README.md should probably be an .rst file, scrapy projects aren't using markdown anywhere else that I know of.

And just maybe you can squash some of these commits,

0e5f9d2 added selectors specific python package files
46e7c32 added selectors readme
e00e6a0 removed unnecessary lines from .gitignore

could probably all go together. Though I don't want to be nit-picking here.

umrashrf commented 9 years ago

Now looking at the commits, your new README.md should probably be an .rst file, scrapy projects aren't using markdown anywhere else that I know of.

Makes sense. I'd migrate to markdown later.

And just maybe you can squash some of these commits,

I also think atomic commits are better. Can you suggest a commit message?

nyov commented 9 years ago

Can you suggest a commit message?

Something like "set up Selectors project environment" maybe.

umrashrf commented 9 years ago

@nyov done those changes. Now only docs are pending.

kmike commented 9 years ago

Sorry, I don't understand all this git voodoo :) But in https://github.com/umrashrf/selectors/tree/master it is possible to check the history of each file using github interface - it shows all relevant commits, "blame" works. In "selectors" branch it doesn't work.

kmike commented 9 years ago

Why should we require users to "set up Selectors project environment"?

nyov commented 9 years ago

wait, what? oh no no no, "set up Selectors project environment" was meant as a commit message, not some user instruction. Could also be "shuffling some files around and adding a new README".

kmike commented 9 years ago

Sorry, I was unclear: why should we require developers who want to use git blame or git history to set up Selectors project environment?

nyov commented 9 years ago

@kmike, I don't really like to write this again, it already feels like I was pushing for this choice, which wasn't my intent. I wasn't thinking about github's blame not working with this choice.

My reason for suggesting git replace is that I think it keeps the history here clean and honest. If we include a copy of a rewritten history here, it's a partial duplicate to scrapy's and potentially not quite right (from dropping files in the rewrite).

In summary, if scrapy.git would be rewritten afterwards to remove selectors history, moving it here would feel right, to me. Since it isn't and scrapy.git is sticking around for the foreseeable future, where you can look it up, duplicating some mangled history (other sha1 sums, commit messages not quite fitting since files are missing from commits) felt not quite right, to me.

That is my opinion, which I hoped to explain, nothing else. Feel free to disagree and whatever works best for everyone else is good enough for me.

umrashrf commented 9 years ago

Sorry, I don't understand all this git voodoo :) But in https://github.com/umrashrf/selectors/tree/master it is possible to check the history of each file using github interface - it shows all relevant commits, "blame" works. In "selectors" branch it doesn't work.

@kmike what about creating a separate branch at Selectors repo with full history? I mean the result of git-replace. Call that separate branch history or anything.

kmike commented 9 years ago

My reason for suggesting git replace is that I think it keeps the history here clean and honest. If we include a copy of a rewritten history here, it's a partial duplicate to scrapy's and potentially not quite right (from dropping files in the rewrite).

In summary, if scrapy.git would be rewritten afterwards to remove selectors history, moving it here would feel right, to me. Since it isn't and scrapy.git is sticking around for the foreseeable future, where you can look it up, duplicating some mangled history (other sha1 sums, commit messages not quite fitting since files are missing from commits) felt not quite right, to me.

What practical problems are caused by unclean dishonest duplicated history, changed sha1 sums and not-quite-fitting commit messages? Unfitting commit messages may cause some confusion, but IMHO they are better than nothing. Having full history can save some time on tracking the code changes, allow people to use github UI and local git without any setup, it also preserves contribution history.

@kmike what about creating a separate branch at Selectors repo with full history? I mean the result of git-replace. Call that separate branch history or anything.

I think that a separate branch won't help much once we start making more changes to selectors; one can always use Scrapy repository to see old changes.

umrashrf commented 9 years ago

I think that a separate branch won't help much once we start making more changes to selectors; one can always use Scrapy repository to see old changes.

Yes we can write a script which can update that separate branch. The advantage is it gives full selectors history in github UI.

kmike commented 9 years ago

Yes we can write a script which can update that separate branch. The advantage is it gives full selectors history in github UI.

This adds complexity and thus makes maintenance harder. It is better to agree on one of the options, be it "drop history before relocation, but provide instructions on how to get it back" or "provide pseudo-history".

Sorry, I still don't get what is clean about dropping the history and why do you like it :) As @Digenis said, "git blame gives you what you couldn't find in the code comments", and pseudo-history gives you a meaningful git blame - it is a practical advantage. There must be something, because it seems so far there is +2 or +3 ( @umrashrf, @nyov and @Digenis?) for dropping pseudo-history vs +1 (me) for providing it.

umrashrf commented 9 years ago

Sorry, I still don't get what is clean about dropping the history and why do you like it :)

I don't like dropping the history and neither do they. git-filter-branch has a good chance to break git-blame usefulness by giving you partial history. I also wondered how and talked this through with @nyov on IRC so I'm just going to paste them here. I was going to paste it before but sorry I got lost in other stuff.

nyov: well, assume at some point parts of the file were moved there from a different file. if you run filter-branch, and removing that other file, the source is lost
nyov: or if the file was renamed in the past, but you keep only one filename through the rewrite.
nyov: or, well, look at the history here: https://github.com/umrashrf/selectors/commits/master
nyov: all the commits now have you as the Committer, meaning the historical commit date is lost - not much use for git blame, right?
nyov: it's not so bad if the original Author was the commiter, too. but if they were different that information is lost
nyov: anyway. I see your reason for wanting to keep the history there. I'm very much for not losing history, myself.
nyov: It would make more sense if you rewrite scrapy.git at the same time, removing all references to selectors, ergo splitting repositories. But since this is a duplicate history, I just felt it more a waste of space to keep a modified history as well

So based on these points what do you think @kmike?

nyov commented 9 years ago

ah, that stupid irc log. :p

@kmike, I disagree on "it also preserves contribution history", at least commit author and dates are overwritten. But count me as +/-0. Both approaches have dis/advantages, I'm not sure which to favor anymore in this case.

dangra commented 9 years ago

Hi guys, let's draw a hard line stating clear we are not rewriting Scrapy history by any chance :)

As for selectors repository history, I +1 to provide pseudo history and make it clear that the project was part of Scrapy and the exact history can only be found there.

And please let's move on with the rest of the work to make Selectors and standalone package

nyov commented 9 years ago

(AN: the same decision is now open for the new scrapy-djangoitems project)

dangra commented 9 years ago

IMO the pseudo history should retain commit dates and authors