thoth-pub / thoth-loader

Metadata import from CSV into Thoth
Apache License 2.0
3 stars 1 forks source link

Feature/scielo chapters #10

Closed brendan-oconnell closed 3 weeks ago

brendan-oconnell commented 5 months ago

Book and Chapter loader for bulk ingest of chapter metadata from SciELO Books JSON. Chapter loader and previously completed book loader have been combined into one file, scieloloader.py, with significant refactoring of the original book loader to reuse methods between the two classes whenever possible, given the somewhat similar structure of the chapter and book JSON metadata.

This loader has been tested with book and chapter metadata for EDITUS and EDUEPB. However, errors in the JSON for other publishers may produce errors that will need to be addressed in subsequent iterations of the loader.

brendan-oconnell commented 4 months ago

It's best to avoid the complexity and potential conflicts of using multiple inheritance.

At a minimum you should remove redundant inheritance:

class ScieloLoader(BookLoader):
class ScieloBookLoader(ScieloLoader):
class ScieloChapterLoader(ScieloLoader, ChapterLoader):

But you can also remove the multiple inheritance in ScieloChapterLoader completely by instantiating ChapterLoader separaretly, e.g.

class ScieloChapterLoader(ScieloLoader):
    chapter_loader = ChapterLoader()

    def run(self):
        # ....
        book_id = self.chapter_loader.get_book_by_title(book_title).workId
        # ...

I removed the multiple inheritance for SciELOLoader and SciELOBookLoader, but when I tried to instantiate ChapterLoader separately with chapter_loader = ChapterLoader(), it threw an error because ChapterLoader needs metadata_file, client_url, email, password for init and I'm not sure how I would pass those values inside the class.

So I've left it for the moment with multiple inheritance i.e. class SciELOChapterLoader(SciELOLoader, ChapterLoader):, however another option would be, since I'm only using one method from ChapterLoader, get_book_by_title, I could just dupljcate this method in SciELOChapterLoader and not have to inherit from ChapterLoader at all. What do you think?

pep8speaks commented 4 months ago

Hello @brendan-oconnell! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 45:121: E501 line too long (123 > 120 characters) Line 135:121: E501 line too long (136 > 120 characters)

Comment last updated at 2024-06-13 08:04:11 UTC
ja573 commented 4 months ago

It's best to avoid the complexity and potential conflicts of using multiple inheritance. At a minimum you should remove redundant inheritance:

class ScieloLoader(BookLoader):
class ScieloBookLoader(ScieloLoader):
class ScieloChapterLoader(ScieloLoader, ChapterLoader):

But you can also remove the multiple inheritance in ScieloChapterLoader completely by instantiating ChapterLoader separaretly, e.g.

class ScieloChapterLoader(ScieloLoader):
    chapter_loader = ChapterLoader()

    def run(self):
        # ....
        book_id = self.chapter_loader.get_book_by_title(book_title).workId
        # ...

I removed the multiple inheritance for SciELOLoader and SciELOBookLoader, but when I tried to instantiate ChapterLoader separately with chapter_loader = ChapterLoader(), it threw an error because ChapterLoader needs metadata_file, client_url, email, password for init and I'm not sure how I would pass those values inside the class.

So I've left it for the moment with multiple inheritance i.e. class SciELOChapterLoader(SciELOLoader, ChapterLoader):, however another option would be, since I'm only using one method from ChapterLoader, get_book_by_title, I could just dupljcate this method in SciELOChapterLoader and not have to inherit from ChapterLoader at all. What do you think?

Yes, let's do that. (NB. from ChapterLoader you're using methods get_book_by_title and create_chapter_relation)

brendan-oconnell commented 4 months ago

@ja573 I see you added a linter, that's helpful, thanks. Should I try to address all of the issues the linter found? Specifically, I got a lot of line too long (123 > 79 characters), I could of course refactor to address all of these but it would take some time...

ja573 commented 4 months ago

@ja573 I see you added a linter, that's helpful, thanks. Should I try to address all of the issues the linter found? Specifically, I got a lot of line too long (123 > 79 characters), I could of course refactor to address all of these but it would take some time...

Even though pep8 wants 80 characters per line, it's more or less accepted nowadays to have 120. I need to include a config file in the repo to override this rule, so just address the ones where lines are longer than 120

brendan-oconnell commented 4 months ago

It's best to avoid the complexity and potential conflicts of using multiple inheritance. At a minimum you should remove redundant inheritance:

class ScieloLoader(BookLoader):
class ScieloBookLoader(ScieloLoader):
class ScieloChapterLoader(ScieloLoader, ChapterLoader):

But you can also remove the multiple inheritance in ScieloChapterLoader completely by instantiating ChapterLoader separaretly, e.g.

class ScieloChapterLoader(ScieloLoader):
    chapter_loader = ChapterLoader()

    def run(self):
        # ....
        book_id = self.chapter_loader.get_book_by_title(book_title).workId
        # ...

I removed the multiple inheritance for SciELOLoader and SciELOBookLoader, but when I tried to instantiate ChapterLoader separately with chapter_loader = ChapterLoader(), it threw an error because ChapterLoader needs metadata_file, client_url, email, password for init and I'm not sure how I would pass those values inside the class. So I've left it for the moment with multiple inheritance i.e. class SciELOChapterLoader(SciELOLoader, ChapterLoader):, however another option would be, since I'm only using one method from ChapterLoader, get_book_by_title, I could just dupljcate this method in SciELOChapterLoader and not have to inherit from ChapterLoader at all. What do you think?

Yes, let's do that. (NB. from ChapterLoader you're using methods get_book_by_title and create_chapter_relation)

Sorry, just to clarify - do that meaning keep the multiple inheritance from SciELOLoader and ChapterLoader in SciELOChapterLoader, or duplicate the methods get_book_by_title and create_chapter_relation in SciELOChapterLoader?

ja573 commented 4 months ago

It's best to avoid the complexity and potential conflicts of using multiple inheritance. At a minimum you should remove redundant inheritance:

class ScieloLoader(BookLoader):
class ScieloBookLoader(ScieloLoader):
class ScieloChapterLoader(ScieloLoader, ChapterLoader):

But you can also remove the multiple inheritance in ScieloChapterLoader completely by instantiating ChapterLoader separaretly, e.g.

class ScieloChapterLoader(ScieloLoader):
    chapter_loader = ChapterLoader()

    def run(self):
        # ....
        book_id = self.chapter_loader.get_book_by_title(book_title).workId
        # ...

I removed the multiple inheritance for SciELOLoader and SciELOBookLoader, but when I tried to instantiate ChapterLoader separately with chapter_loader = ChapterLoader(), it threw an error because ChapterLoader needs metadata_file, client_url, email, password for init and I'm not sure how I would pass those values inside the class. So I've left it for the moment with multiple inheritance i.e. class SciELOChapterLoader(SciELOLoader, ChapterLoader):, however another option would be, since I'm only using one method from ChapterLoader, get_book_by_title, I could just dupljcate this method in SciELOChapterLoader and not have to inherit from ChapterLoader at all. What do you think?

Yes, let's do that. (NB. from ChapterLoader you're using methods get_book_by_title and create_chapter_relation)

Sorry, just to clarify - do that meaning keep the multiple inheritance from SciELOLoader and ChapterLoader in SciELOChapterLoader, or duplicate the methods get_book_by_title and create_chapter_relation in SciELOChapterLoader?

Duplicate the methods, but in BookLoader

ja573 commented 4 months ago

@ja573 I see you added a linter, that's helpful, thanks. Should I try to address all of the issues the linter found? Specifically, I got a lot of line too long (123 > 79 characters), I could of course refactor to address all of these but it would take some time...

Even though pep8 wants 80 characters per line, it's more or less accepted nowadays to have 120. I need to include a config file in the repo to override this rule, so just address the ones where lines are longer than 120

You should get the config file now if you merge develop onto your branch (or cherry-pick fc37568)

brendan-oconnell commented 4 months ago

Hello @brendan-oconnell! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 6:1: E302 expected 2 blank lines, found 1

Line 6:1: E302 expected 2 blank lines, found 1 Line 10:1: W391 blank line at end of file

Line 6:1: E302 expected 2 blank lines, found 1

Comment last updated at 2024-02-12 14:19:35 UTC

Hello @brendan-oconnell! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 6:1: E302 expected 2 blank lines, found 1

Line 6:1: E302 expected 2 blank lines, found 1 Line 10:1: W391 blank line at end of file

Line 6:1: E302 expected 2 blank lines, found 1

Comment last updated at 2024-02-12 14:19:35 UTC

@ja573 this is out of date, I've fixed all these linter errors in the latest push but I guess it hasn't caught up yet for some reason...