运行α选股策略时报错

您好！我在运行α选股策略时出现如下报错，请问该如何修改？

ValueError                                Traceback (most recent call last)
Cell In[3], line 6
      4 # 获取所有股票的总市值、总负债、总现金、EBITDA数据
      5 print(shares[0:10])
----> 6 dt = qt.get_history_data(htypes, shares=shares, asset_type='any', freq='q')
      7 # 随便选择一支股票，转化为DataFrame检查数据是否正确获取
      8 one_share = shares[24]

File D:\Python\Lib\site-packages\qteasy\core.py:1050, in get_history_data(htypes, shares, symbols, start, end, freq, rows, asset_type, adj, as_data_frame, group_by, **kwargs)
   1048 elif group_by in ['htypes', 'htype', 'h']:
   1049     group_by = 'htypes'
-> 1050 hp = get_history_panel(htypes=htypes, shares=shares, start=start, end=end, freq=freq, asset_type=asset_type,
   1051                        symbols=symbols, rows=rows, adj=adj, **kwargs)
   1053 if as_data_frame:
   1054     return hp.unstack(by=group_by)

File D:\Python\Lib\site-packages\qteasy\history.py:2426, in get_history_panel(htypes, shares, symbols, freq, start, end, rows, asset_type, adj, data_source, drop_nan, resample_method, b_days_only, trade_time_only, **kwargs)
   2424     pure_ref_htypes = [itm[0] for itm in htype_splits if len(itm) == 1]
   2425 # 获取常规类型的历史数据如量价数据和指标数据
-> 2426 normal_dfs = ds.get_history_data(
   2427         shares=shares,
   2428         htypes=normal_htypes,
   2429         start=start,
   2430         end=end,
   2431         freq=freq,
   2432         row_count=rows,
   2433         asset_type=asset_type,
   2434         adj=adj
   2435 ) if normal_htypes else {}
   2436 # 获取指数成分权重数据
   2437 weight_dfs = ds.get_index_weights(
   2438         index=weight_indices,
   2439         start=start,
   2440         end=end,
   2441         shares=shares
   2442 ) if weight_indices else {}

File D:\Python\Lib\site-packages\qteasy\database.py:4676, in DataSource.get_history_data(self, shares, symbols, htypes, freq, start, end, row_count, asset_type, adj)
   4674 if not df.empty:
   4675     htyp_series = df[htyp]
-> 4676     new_df = htyp_series.unstack(level=0)
   4677     old_df = df_by_htypes[htyp]
   4678     # 使用两种方法实现df的合并，分别是merge()和join()
   4679     # df_by_htypes[htyp] = old_df.merge(new_df,
   4680     #                                   how='outer',
   4681     #                                   left_index=True,
   4682     #                                   right_index=True,
   4683     #                                   suffixes=('', '_y'))

File D:\Python\Lib\site-packages\pandas\core\series.py:4455, in Series.unstack(self, level, fill_value, sort)
   4410 """
   4411 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
   4412 
   (...)
   4451 b    2    4
   4452 """
   4453 from pandas.core.reshape.reshape import unstack
-> 4455 return unstack(self, level, fill_value, sort)

File D:\Python\Lib\site-packages\pandas\core\reshape\reshape.py:517, in unstack(obj, level, fill_value, sort)
    515 if is_1d_only_ea_dtype(obj.dtype):
    516     return _unstack_extension_series(obj, level, fill_value, sort=sort)
--> 517 unstacker = _Unstacker(
    518     obj.index, level=level, constructor=obj._constructor_expanddim, sort=sort
    519 )
    520 return unstacker.get_result(
    521     obj._values, value_columns=None, fill_value=fill_value
    522 )

File D:\Python\Lib\site-packages\pandas\core\reshape\reshape.py:154, in _Unstacker.__init__(self, index, level, constructor, sort)
    146 if num_cells > np.iinfo(np.int32).max:
    147     warnings.warn(
    148         f"The following operation may generate {num_cells} cells "
    149         f"in the resulting pandas object.",
    150         PerformanceWarning,
    151         stacklevel=find_stack_level(),
    152     )
--> 154 self._make_selectors()

File D:\Python\Lib\site-packages\pandas\core\reshape\reshape.py:210, in _Unstacker._make_selectors(self)
    207 mask.put(selector, True)
    209 if mask.sum() < len(self.index):
--> 210     raise ValueError("Index contains duplicate entries, cannot reshape")
    212 self.group_index = comp_index
    213 self.mask = mask

ValueError: Index contains duplicate entries, cannot reshape

**Environment

OS: [windows10]
Versions [python:3.11.6 and qteasy:1.2.11]
Environment [numpy:1.26.2; pandas:2.1.2
DataSource [type of datasource:csv and overview with qt.get_table_overview()]

Following tables contain local data, to view complete list, print returned DataFrame
                Has_data Size_on_disk Record_count Record_start Record_end
table                                                                     
trade_calendar    True        2.0MB         75K      19901012    20241231 
stock_basic       True        855KB          5K          None        None 
stock_company     True       11.4MB          6K          None        None 
index_basic       True        3.6MB         11K          None        None 
fund_basic        True        4.2MB         17K          None        None 
future_basic      True        1.7MB          9K          None        None 
opt_basic         True        8.5MB         53K          None        None 
stock_daily       True      781.8MB       10.2M      20110104    20231229 
index_daily       True        675KB          6K      20110104    20231229 
stock_indicator   True       1.39GB       10.3M      20110104    20231229 
balance           True      150.0MB        233K      20011231    20220930 
cashflow          True      116.1MB        214K      20061231    20220930 
financial         True      445.6MB        368K      20110331    20221231

ps:运行qt.get_table_overview()报错UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 195: illegal multibyte sequence，将database.py中2930行with open(file_path_name, 'r') as fp:改为with open(file_path_name, 'r',encoding='utf-8') as fp:

感谢您的反馈！初步看起来是数据存储的问题。我会仔细看一下，有可能会问一些问题并请您提供更多的信息以帮助复现bug。首先注意到您在database.py中增加了encoding='utf-8'，这样做是仅仅为了解决get_table_overview()的错误吗？还是其他函数也有同样报错？如果是，哪些函数曾经有报错过？

我尝试复现您遇到的bug，认为它应该跟数据源中存储的数据有关，如果数据源中存储的数据存在重复的index（主键值），就会出现您遇到的错误，参见下面的例子：

# 一组没有重复index的数据可以unstack():
>>> df
     c  d
a b      
1 1  1  6
  2  2  7
  3  3  8
2 1  4  9
  2  5  0
>>> df.unstack()
     c              d          
b    1    2    3    1    2    3
a                              
1  1.0  2.0  3.0  6.0  7.0  8.0
2  4.0  5.0  NaN  9.0  0.0  NaN

# 但是如果df存在重复的主键值（index），再unstack就会报错：
>>> df2
     c  d
a b      
1 1  1  9
  2  2  8
  3  3  7  # 注意这一行的index 为 (1, 3)
2 1  4  6
  2  5  5
1 3  6  4  # 注意这一行的index 同样为 (1, 3)

# 接下来就报错了：
>>> df2.unstack()
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/frame.py", line 9928, in unstack
    result = unstack(self, level, fill_value, sort)
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 504, in unstack
    return _unstack_frame(obj, level, fill_value=fill_value, sort=sort)
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 529, in _unstack_frame
    unstacker = _Unstacker(
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 154, in __init__
    self._make_selectors()
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 210, in _make_selectors
    raise ValueError("Index contains duplicate entries, cannot reshape")
ValueError: Index contains duplicate entries, cannot reshape

接下来，问题的关键在于数据源中存储的数据为何会出现重复的index。

我在qteasy中定义数据表的同时对每张表定义了主键 Primary Key，一般来说是日期时间或者股票代码。正常来说，往数据源中写入数据时，系统会自动去掉主键值重复的数据，已确保数据不会出现重复。如果您使用的是Mysql作为数据源，这一点肯定是能保证的。不过我注意到您使用的是csv文件存储数据，那么理论上是有可能在数据的index中存在重复值的。

目前我无法确定是写入数据表的函数中存在bug，还是您曾经手动操作并新增过数据，因此，我需要您的协助以解决此问题：

1，建议您暂时性在database.py文件的第4676行做如下修改：

4676    try:
4677        new_df = htyp_series.unstack(level=0)
4678    except:
4679        import pdb; pdb.set_trace()

上面的代码会在new_df = htyp_series.unstack(level=0)行执行发生问题时启动调试器，启动调试器后，您可以进入调试模式并查看htyp_series的值： (如果您不熟悉pdb，在进入调试模式后系统会显示命令提示符(pdb):，这时您可以输入变量名，即可显示变量的值：完成后输入q即可退出调试模式）

(pdb): htyp_series
# 此时应该会显示htyp_series的值

2，请告诉我htyp_series的值，它能告诉我是哪一张数据表的哪一个字段存在问题

3，请用Excel打开出问题的csv文件，检查是否存在重复的键值

通过上面的操作，我们可能可以确定错误发生的原因，我会进一步检查，希望能够尽快找到根本原因，非常感谢您的协助！

好的，麻烦您了！database.py中的增加encoding是为了处理get_table_overview的报错，目前没发现其他函数有类似问题。

---- 回复的原邮件 ---- | 发件人 | Jackie @.> | | 日期 | 2024年06月15日 06:21 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [shepherdpp/qteasy] 运行α选股策略时报错 (Issue #167) |

感谢您的反馈！初步看起来是数据存储的问题。我会仔细看一下，有可能会问一些问题并请您提供更多的信息以帮助复现bug。首先注意到您在database.py中增加了encoding='utf-8'，这样做是仅仅为了解决get_table_overview()的错误吗？还是其他函数也有同样报错？如果是，哪些函数曾经有报错过？

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

shepherdpp / qteasy

运行α选股策略时报错 #167