Order is assumed to be fixed

aubertc commented 7 months ago

Consider the following sheet:

the output is

Welcome to Abstract/DOI Finder.
You did not provide a file to process, I will look in the input/ folder for a spreadsheet.
I found exactly one spreadsheet in the input folder:
        /donnees/travail/git/abstract_doi_finder/abstract_doi_finder/input/test_input.xlsx
And will process that file.
An abstract column has been inserted for each sheet with a title column with no abstract column already existing.
A DOI column has been inserted for each sheet with a title column with no DOI column already existing.
An abstract column has been inserted for each sheet with a title column with no abstract column already existing.
A DOI column has been inserted for each sheet with a title column with no DOI column already existing.
The program is currently working on the sheet: R1T.csv
Number of publications that had an abstract for the sheet with the author 'Aubert, Clément'' on PubMed: 1/1
Number of publications that had a DOI for the sheet with the author 'Aubert, Clément' on PubMed: 1/1
The program is currently working on the sheet: TR1.csv
Number of publications that had an abstract for the sheet with the author 'Data Integration for the Study of Outstanding Productivity in Biomedical Research'' on PubMed: 0/0
Number of publications that had a DOI for the sheet with the author 'Data Integration for the Study of Outstanding Productivity in Biomedical Research' on PubMed: 0/0
Thanks for coming! Your abstracts and DOIs should be in your Excel file now

as we can see, the program believes that the author is the title starting with the second sheet. My guess is that the program assumes the structure of the cells to be fixed, that is, if the author was in column X in the first sheet, then it will always be in column X in the next sheet(s).

aubertc commented 7 months ago

After testing some, my hypothesis is not valid. The program may simply assume that the title comes after the author? Really not sure what's happening.

The problem is seen through the

with the author 'Data Integration for the Study of Outstanding Productivity in Biomedical Research'

in the output above.

KingAdam2004 commented 7 months ago

For this problem, I am not sure why it is skipping the first sheet within the test_input, but I do know why the output is the way it is. The program does not assume the title column comes after the author, as it has to specifically search for the title column for each sheet.

The output comes from whatever is in the author/researcher column. However, I cannot make sense of why it chose 'Data Integration for the Study of Outstanding Productivity in Biomedical Research'. The way I remember the program choosing the first element in the data gathered from each excel sheet.

KingAdam2004 commented 7 months ago

The output comes from whatever is in the author/researcher column. However, I cannot make sense of why it chose 'Data Integration for the Study of Outstanding Productivity in Biomedical Research'. The way I remember the program choosing the first element in the data gathered from each excel sheet. I realized that this is because the author sheet usually came first within the sheets we experimented on, so we assumed the author column always came before any title column.

This can probably easily be fixed by adding a way to track the index of the author column and title columns separately so this mistake is not made. I overlooked this possibility because of the simplicity of the tests we made.

aubertc commented 6 months ago

Ok, thanks. Can you try to implement that?

KingAdam2004 commented 6 months ago

This commit (https://github.com/popbr/abstract_doi_finder/commit/d9fa1be60e19993c4eac0917cd543ae1ef43f66d) should fix the issue we were having. Additionally, it should fix some of the other issues that we were having.

Please take a look when you have the time.

popbr / abstract_doi_finder

Order is assumed to be fixed #17