juliema / label_reconciliations

Code for reconciling multiple transcriptions for a label
MIT License
26 stars 11 forks source link

Duplicating columns in reconciled #77

Open PmasonFF opened 1 year ago

PmasonFF commented 1 year ago

Running a -f csv format with the following flattened file and reconciled parameters: flatten_sample_label_transcription_sorted.csv reconcile_parameters_sample.txt

The column Taxon_label ends up duplicated in the reconciled output, and the summary fails to build with this error:

C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\summary.py:231: FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
  ids = flag_df[flag_df[col].apply(
Traceback (most recent call last):
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 236, in <module>
    main()
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 229, in main
    summary.report(args, unreconciled, reconciled)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\summary.py", line 39, in report
    filters = get_filters(args, flag_df)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\summary.py", line 231, in get_filters
    ids = flag_df[flag_df[col].apply(
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\frame.py", line 3791, in __getitem__
    return self.where(key)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\util\_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\frame.py", line 11910, in where
    return super().where(
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\util\_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\generic.py", line 9968, in where
    return self._where(cond, other, inplace, axis, level)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\generic.py", line 9663, in _where
    cond = cond.reindex(self._info_axis, axis=self._info_axis_number, copy=False)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\util\_decorators.py", line 347, in wrapper
    return func(*args, **kwargs)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\frame.py", line 5194, in reindex
    return super().reindex(**kwargs)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\generic.py", line 5289, in reindex
    return self._reindex_axes(
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\frame.py", line 4987, in _reindex_axes
    frame = frame._reindex_columns(
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\frame.py", line 5032, in _reindex_columns
    return self._reindex_with_indexers(
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\generic.py", line 5355, in _reindex_with_indexers
    new_data = new_data.reindex_indexer(
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\internals\managers.py", line 729, in reindex_indexer
    self.axes[axis]._validate_can_reindex(indexer)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\venv\lib\site-packages\pandas\core\indexes\base.py", line 4359, in _validate_can_reindex
    raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels

If I change the name of the next column Taxon to a name without the string "Taxon" in it then everything runs fine. Duplication occurs when there is a column Taxon_label and a column Taxon in the flattened file.

The table reconciled does NOT appear to have any duplicates in the header ie printing reconciled.headers.keys() gives:

dict_keys(['subject_id', 'classification_id', 'user_name', 'NHMD', 'Scientific_Name', 'workflow_id', 'workflow_version', 'created_at', 'Fully_visible', 'Accession', 'Belonged', 'Type_label', 'Location', 'month', 'day', 'year', 'Date', 'Collector', 'Taxon_label', 'Taxon', 'Determiner', 'Sex', 'Genital_prep', 'Prep_number', 'Prepared_by', 'ZMUC_number'])

However printing reconciled.to_df(args).columns gives:

Index(['subject_id', 'NHMD', 'Scientific_Name', 'Fully_visible', 'Accession',
       'Belonged', 'Type_label', 'Location', 'month', 'day', 'year',
       'Collector', 'Taxon_label', 'Taxon_label', 'Taxon', 'Determiner', 'Sex',
       'Genital_prep', 'Prep_number', 'Prepared_by', 'ZMUC_number'],
      dtype='object')

which has the duplication in it.

Printing the args at that point gives:

Namespace(input_file='C:\\py_scripts\\Scripts_reconcile_editor\\Demo\\flatten_sample_label_transcription_sorted.csv', unreconciled=None, reconciled='C:\\py_scripts\\Scripts_reconcile_editor\\Demo\\reconciled_sample_label_transcription.csv', summary='C:\\py_scripts\\Scripts_reconcile_editor\\Demo\\summary_sample_label_transcription.html', explanations=True, zip=None, workflow_name=None, workflow_id=None, fuzzy_ratio_threshold=90, fuzzy_set_threshold=50, workflow_csv='', format='csv', column_types=['NHMD:select,Scientific_Name:select,Fully_visible:select,Accession:select,Belonged:text,Type_label:select,Location:text,month:select,day:select,year:select,Collector:text,Taxon_label:select,Taxon:text,Determiner:text,Sex:select,Genital_prep:select,Prep_number:text,Prepared_by:text,ZMUC_number:text'], group_by='subject_id', page_size=20, no_summary_detail=False, row_key='classification_id', user_column='user_name', max_transcriptions=50)

These look correct so I am baffled...