renzhengeek / issues

0 stars 0 forks source link

testing-12-23 #46

Closed renzhengeek closed 8 years ago

renzhengeek commented 8 years ago

random, sleep for 3ms

3774|   581   188   478   132   952   966   502    58   197   316 ...  859 write time:2843 msecs
3775|     9    32   539   160   134    22    33   708    90   585 ...   96 write time:678 msecs
3776|    82   754   248   440   854   749   710   838   829   969 ...   49 write time:453 msecs
3777|   359   549   868   970   803   900   437   328   885    91 ...  387 write time:445 msecs
3778|   788   177   889   352   867   834   755   759   153   361 ...  231 write time:3535 msecs
3779|   240   428   908   643   575   514   868   470   186   986 ...  709 write time:2304 msecs
3780|   368   733   260   792   108    34   708   623    67   807 ...  572 write time:268 msecs
3781|   254   567   637    67   961   909   978   192   785   574 ...  306 write time:373 msecs
3782|   768    83    90   326   862   621   918   144   828   392 ...  700 write time:6429 msecs
3783|   634   343   842   833   289   668   978   436   445   203 ...  999 write time:899 msecs
3784|   325   366   413   116   470   850   590   645   876   529 ...  832 write time:7103 msecs
3785|   288   213   698    14    29   576   570   717   312   385 ...  974 write time:267 msecs
3786|   645   850   621   966   865   210   881   451   941    15 ...  485 write time:232 msecs
3787|   597   214   738   722   329   349   795   475   779   638 ...  101 write time:3977 msecs
3788|   756   549   340   465    73   106   889   786   706   858 ...  950 write time:272 msecs
3789|   435   890   485   461   778   779    49   106   168   485 ...  293 write time:825 msecs
3790|   793   932   877   247   228   489   168   640   673   410 ...  707 write time:89 msecs
3791|    37   860   270   917   588   164    18   817   766   282 ...  857 write time:805 msecs
3792|   209   319   538   887   588   961   422   580   562   914 ...  388 write time:5195 msecs
3793|   567   461   871   934   608   780   120   869   705   129 ...  311 write time:5685 msecs
3794|   496   838   456   451   152   616   477   926   111   685 ...   25 write time:1056 msecs
3795|   141   706   610   740   315   982   302   918   739    16 ...  653 write time:3275 msecs
3796|   312   907   607   253   981   325   652   945   366   938 ...  649 write time:604 msecs
3797|   166   327   577     9   267    53   160   157   382   636 ...   79 write time:448 msecs
3798|   970   502   144   808   521   247   828   746   969   364 ...  807 write time:1148 msecs
3799|   933   588   909   410    26   400   818   618   393    95 ...  875 write time:1645 msecs

reading

 63|   912   863   852    73   845   991   773   856   462   469 ...  580 read time:23700 msec
  64|   691   514   656    55    79   120   580   655   133   756 ...  752 read time:3 msec
  65|   691   514   656    55    79   120   580   655   133   756 ...  391 read time:2 msec
  66|   691   514   656    55    79   120   580   655   133   756 ...  752 read time:3 msec
  67|    62    98   941   410   948    22   879   102   762   784 ...  845 read time:14400 msec
  68|    62    98   941   410   948    22   879   102   762   784 ...  584 read time:2 msec
  69|    62    98   941   410   948    22   879   102   762   784 ...  798 read time:2 msec
  70|   855   575    90   893   638   115   244   148   524   517 ...  802 read time:21854 msec
  71|   855   575    90   893   638   115   244   148   524   517 ...  303 read time:2 msec
  72|   995   100    17   909   102   108    92   635   274   193 ...  215 read time:34321 msec
  73|   995   100    17   909   102   108    92   635   274   193 ...  219 read time:2 msec
  74|   432   805   889    90   583   381    76    21   441   594 ...  457 read time:17721 msec
  75|   432   805   889    90   583   381    76    21   441   594 ...  457 read time:2 msec
  76|   932    74   829     3   523   296   316   195   612   110 ...  959 read time:37808 msec
  77|   932    74   829     3   523   296   316   195   612   110 ...  959 read time:2 msec
  78|   932    74   829     3   523   296   316   195   612   110 ...  196 read time:1 msec
  79|   514   743   729   826   934   972   929   160   134   419 ...  922 read time:12317 msec
  80|   514   743   729   826   934   972   929   160   134   419 ...  106 read time:2 msec
  81|   514   743   729   826   934   972   929   160   134   419 ...   26 read time:31693 msec
  82|     0   814    84   862   182   389   317   317   351    67 ...  170 read time:12 msec
  83|     0   814    84   862   182   389   317   317   351    67 ...  349 read time:2 msec
  84|     0   899   725   456   587    15   188   509   613   152 ...  314 read time:235941 msec
  85|     0   907   104   205    17   845   598   302    60   694 ...  314 read time:7 msec
  86|     0   907   104   205    17   845   598   302    60   694 ...  314 read time:2 msec
  87|     0   907   104   205    17   845   598   302    60   694 ...  314 read time:1 msec
  88|     0   500   704   658   469     4   152   156   444   345 ...  503 read time:148268 msec
  89|     0   500   704   658   469     4   152   156   444   345 ...  503 read time:2 msec
  90|     0   500   704   658   469     4   152   156   444   345 ...  503 read time:1 msec
renzhengeek commented 8 years ago

down_read()

5665|    71   375   698   332   619   411   926   865   693   470 ...  838 write time:673 msecs
5666|   110   518   630   839   709   191   989    57   862   783 ...  381 write time:1038 msecs
5667|   380    69   706   464   941   185   303   379   191   400 ...  990 write time:254 msecs
5668|   589   713   780   828   961   710   521   464   361    46 ...   83 write time:149 msecs
5669|   170   576   760   614   187   578    94   423   680   602 ...  637 write time:317 msecs
5670|   438   463   784   774   801   598    13   738   162   209 ...  378 write time:1183 msecs
5671|   536   533   420   296    17   714   634   802   315   415 ...  552 write time:2175 msecs
5672|   352   124   120   594   812   561   432   181   790   148 ...  577 write time:17648 msecs
5673|   574   844   335   957   320   522   472    85   259   577 ...  673 write time:18455 msecs
5674|   183   239   157   777   700   342   196   227   159   440 ...  139 write time:16167 msecs
5675|    49   270   847   752   821   361   612   613   138   437 ...  713 write time:19659 msecs
5676|   722   179   695   632   868   255   120   376   749   147 ...  154 write time:16305 msecs
5677|   746    58   881   843   999   285   959   410   252   905 ...  566 write time:4785 msecs
5678|   172   703   320   279   862     4   301   234   960   446 ...  660 write time:163 msecs
5679|   511   651   586   369   719   178   658   933    16   785 ...  406 write time:20170 msecs
5680|   227   997   563   968   625   224   769   456   969   819 ...  121 write time:7325 msecs
5681|   434   362    74   589   405   119   144   819    79   333 ...  349 write time:3016 msecs
5682|   543   607   609   438   270   511   919   581    99   848 ...  278 write time:20225 msecs
5683|   319   463   742    27   382   399   163   358   177   971 ...  265 write time:11125 msecs
5684|    55   273   509   281   550   510   839   430   498   121 ...  345 write time:15071 msecs
5685|   746   121   737   321   122   813   144     5   426   697 ...  313 write time:21127 msecs
5686|   862   996   157   550   393   904     9    65   148   319 ...  682 write time:18823 msecs
5687|   224   136   489   389   366   931     9   201   146    85 ...   55 write time:18297 msecs
5688|   113   461   933   732    38   113   301   110   760   655 ...   42 write time:21366 msecs
5689|   539   198   953   758   742   958   826   343   688   442 ...  255 write time:13590 msecs
5690|   846   383   649   668   587   567   569   369   180   693 ...  536 write time:16419 msecs
5691|   904   832   464   268   742   799   606   883   168   379 ...  527 write time:21237 msecs

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  80|     0   130   378   313   140   720   505   839   183   279 ...  238 read time:14618 msec
  81|   561   591   721   117   961   409   456   819   961    57 ...  220 read time:24634 msec
  82|   295   802    64   127   147   902   164   386   113   384 ...  104 read time:14043 msec
  83|     0   313   196   163   186   768   150   868   733   828 ...  121 read time:13797 msec
  84|    10   939   820   855   846   744   332   407   145   916 ...  122 read time:16537 msec
  85|   218   514   717   148   318   387   995    83   378   226 ...  122 read time:12905 msec
  86|     0   623   465   442   317   631   564   303   345   457 ...  122 read time:12005 msec
  87|     0   924   229   113     1   919   877   148   980    17 ...  122 read time:15248 msec
  88|     0   244   875   482   525   814   731   617   939   298 ...  122 read time:23633 msec
  89|   669   670   103   498   922   265   210   734     6   955 ...  368 read time:19300 msec
  90|     0   979   123   563   692   815   578   232    45     5 ...  552 read time:40553 msec
  91|   352   124   120   594   812   561   352   124   120   594 ...  938 read time:18201 msec
  92|   574   844   335   957   320   522   472    85   259   577 ...  552 read time:12540 msec
  93|     0   460   908   412   421    58   669   830    77   605 ...  266 read time:21030 msec
  94|    49   270   847   752   821   361   612   613   138   437 ...  779 read time:17169 msec
  95|     0   612   859   922   626    46   903   909    64   936 ...  673 read time:13692 msec
  96|     0   345   492   250   794   214   901    78   582   307 ...  761 read time:25486 msec
  97|     0   345   492   250   794   214   901    78   582   307 ...  681 read time:24884 msec
  98|   543   607   609   438   270   511   919   581    99   848 ...  904 read time:31177 msec
  99|    55   273   509   281   550   510   839   430   498   121 ...  226 read time:16441 msec
 100|   746   121   737   321   122   813   144     5   426   697 ...  325 read time:14397 msec
 101|   862   996   157   550   393   904     9    65   148   319 ...  749 read time:15666 msec
 102|   224   136   489   389   366   931     9   201   146    85 ...  313 read time:16936 msec
 103|   113   461   933   732    38   113   301   110   760   655 ...  313 read time:13892 msec
 104|   113   461   933   732    38   113   301   110   760   655 ...  173 read time:23281 msec
renzhengeek commented 8 years ago
vfs_read()
 ocfs2_file_aio_read()
   ocfs2_inode_lock_atime
   ocfs2_inode_unlock()
     generic_file_aio_read()
       do_generic_file_read()
          page_cache_sync_readahead() or page_cache_async_readahead()
             ocfs2_readpages()
                 ocfs2_inode_lock_full(NOBLOCK)
                 down_read_trylock(ip_alloc_sem)
             ocfs2_readpage() /* continue to retry until succeed */
                 ocfs2_inode_lock_with_page()
                 down_read_trylock(ip_alloc_sem)
                 ocfs2_inode_unlock
             PAGE OK: copy page to user space 
renzhengeek commented 8 years ago
vfs_write()
   ocfs2_file_aio_write()
     ocfs2_rw_lock(EX)
      generic_buffered_write()
       generic_perform_write()
        ocfs2_write_begin()
             ocfs2_inode_lock_full()
             dow_write(ip_alloc_sem)
         copy from user space
         ocfs2_write_end()
              up_write(ip_alloc_sem)
              ocfs2_inode_unlock
     ocfs2_rw_unlock()
renzhengeek commented 8 years ago

3583 static int ocfs2_data_convert_worker(struct ocfs2_lock_res *lockres, 3584 int blocking) 3585 { ... 3616 sync_mapping_buffers(mapping); 3617 if (blocking == DLM_LOCK_EX) { 3618 truncate_inode_pages(mapping, 0); 3619 } else { .... }

renzhengeek commented 8 years ago

<zren> Hello Goldwyn, just give new thoughts on the L3 bug here, hah
<zren> 1. paste R/W path here to be convenient for us to talk about
<zren> vfs_read()
<zren>  ocfs2_file_aio_read()
<zren>    ocfs2_inode_lock_atime
<zren>    ocfs2_inode_unlock()
<zren>      generic_file_aio_read()
<zren>        do_generic_file_read()
<zren>           page_cache_sync_readahead() or page_cache_async_readahead()
<zren>              ocfs2_readpages()
<zren>                  ocfs2_inode_lock_full(NOBLOCK)
<zren>                  down_read_trylock(ip_alloc_sem)
<zren>              ocfs2_readpage() /* continue to retry until succeed */
<zren>                  ocfs2_inode_lock_with_page()
<zren>                  down_read_trylock(ip_alloc_sem)
<zren>                  ocfs2_inode_unlock
<zren>              PAGE OK: copy page to user space 
<zren> vfs_write()
<zren>    ocfs2_file_aio_write()
<zren>      ocfs2_rw_lock(EX)
<zren>       generic_buffered_write()
<zren>        generic_perform_write()
<zren>         ocfs2_write_begin()
<zren>              ocfs2_inode_lock_full()
<zren>              dow_write(ip_alloc_sem)
<zren>          copy from user space
<zren>          ocfs2_write_end()
<zren>               up_write(ip_alloc_sem)
<zren>               ocfs2_inode_unlock
<zren>      ocfs2_rw_unlock()
<zren> 2. my new thoughts on this bug
<zren> For read(), the cluster lock just be held when doing readpage() stuff, that's to get disk data to page cache.
<zren> In customer case, the contention of cluster lock for same file is rather fierce, so it's likely EX lock from remote node has been already queued.
<zren> when returning from readpage(), cluster lock has been released, which will tick up down convert thread to work. ocfs2dc will truncate all cached page:
<zren> 3583 static int ocfs2_data_convert_worker(struct ocfs2_lock_res *lockres,
<zren> 3584                                      int blocking)
<zren> 3585 {
<zren>  ...
<zren> 3616         sync_mapping_buffers(mapping);
<zren> 3617         if (blocking == DLM_LOCK_EX) {
<zren> 3618                 truncate_inode_pages(mapping, 0);
<zren> 3619         } else {
<zren>   ....
<zren> }
<zren> According to testing output, it looks that the more fiercer the contention is, the more unfair read() is treated...
renzhengeek commented 8 years ago
<zren> 1. paste R/W path here to be convenient for us to talk about
<zren> vfs_read()
<zren>  ocfs2_file_aio_read()
<zren>    ocfs2_inode_lock_atime
<zren>    ocfs2_inode_unlock()
<zren>      generic_file_aio_read()
<zren>        do_generic_file_read()
<zren>           page_cache_sync_readahead() or page_cache_async_readahead()
<zren>              ocfs2_readpages()
<zren>                  ocfs2_inode_lock_full(NOBLOCK)
<zren>                  down_read_trylock(ip_alloc_sem)
<zren>              ocfs2_readpage() /* continue to retry until succeed */
<zren>                  ocfs2_inode_lock_with_page()
<zren>                  down_read_trylock(ip_alloc_sem)
<zren>                  ocfs2_inode_unlock
<zren>              PAGE OK: copy page to user space 
<zren> vfs_write()
<zren>    ocfs2_file_aio_write()
<zren>      ocfs2_rw_lock(EX)
<zren>       generic_buffered_write()
<zren>        generic_perform_write()
<zren>         ocfs2_write_begin()
<zren>              ocfs2_inode_lock_full()
<zren>              dow_write(ip_alloc_sem)
<zren>          copy from user space
<zren>          ocfs2_write_end()
<zren>               up_write(ip_alloc_sem)
<zren>               ocfs2_inode_unlock
<zren>      ocfs2_rw_unlock()
<zren> 2. my new thoughts on this bug
<zren> For read(), the cluster lock just be held when doing readpage() stuff, that's to get disk data to page cache.
<zren> In customer case, the contention of cluster lock for same file is rather fierce, so it's likely EX lock from remote node has been already queued.
<zren> when returning from readpage(), cluster lock has been released, which will tick up down convert thread to work. ocfs2dc will truncate all cached page:
<zren> 3583 static int ocfs2_data_convert_worker(struct ocfs2_lock_res *lockres,
<zren> 3584                                      int blocking)
<zren> 3585 {
<zren>  ...
<zren> 3616         sync_mapping_buffers(mapping);
<zren> 3617         if (blocking == DLM_LOCK_EX) {
<zren> 3618                 truncate_inode_pages(mapping, 0);
<zren> 3619         } else {
<zren>   ....
<zren> }
<zren> According to testing output, it looks that the more fiercer the contention is, the more unfair read() is treated...
<zren> I changed customer testing program to random R/W way and sleep for 100ms after R/W 1M size. 
<zren> I saw write() may take about 5s, but read() may take >20s, then 2 ms, then >20s, repeating like that...
<zren> write():
<zren> 789|   435   890   485   461   778   779    49   106   168   485 ...  293 write time:825 msecs
<zren> 3790|   793   932   877   247   228   489   168   640   673   410 ...  707 write time:89 msecs
<zren> 3791|    37   860   270   917   588   164    18   817   766   282 ...  857 write time:805 msecs
<zren> 3792|   209   319   538   887   588   961   422   580   562   914 ...  388 write time:5195 msecs
<zren> 3793|   567   461   871   934   608   780   120   869   705   129 ...  311 write time:5685 msecs
<zren> 3794|   496   838   456   451   152   616   477   926   111   685 ...   25 write time:1056 msecs
<zren> 3795|   141   706   610   740   315   982   302   918   739    16 ...  653 write time:3275 msecs
<zren> 3796|   312   907   607   253   981   325   652   945   366   938 ...  649 write time:604 msecs
<zren> 3797|   166   327   577     9   267    53   160   157   382   636 ...   79 write time:448 msecs
<zren> 3798|   970   502   144   808   521   247   828   746   969   364 ...  807 write time:1148 msecs
<zren> 3799|   933   588   909   410    26   400   818   618   393    95 ...  875 write time:1645 msecs
<zren> read():
<zren> 63|   912   863   852    73   845   991   773   856   462   469 ...  580 read time:23700 msec
<zren>   64|   691   514   656    55    79   120   580   655   133   756 ...  752 read time:3 msec
<zren>   65|   691   514   656    55    79   120   580   655   133   756 ...  391 read time:2 msec
<zren>   66|   691   514   656    55    79   120   580   655   133   756 ...  752 read time:3 msec
<zren>   67|    62    98   941   410   948    22   879   102   762   784 ...  845 read time:14400 msec
<zren>   68|    62    98   941   410   948    22   879   102   762   784 ...  584 read time:2 msec
<zren>   69|    62    98   941   410   948    22   879   102   762   784 ...  798 read time:2 msec
<zren>   70|   855   575    90   893   638   115   244   148   524   517 ...  802 read time:21854 msec
<zren>   71|   855   575    90   893   638   115   244   148   524   517 ...  303 read time:2 msec
<zren>   72|   995   100    17   909   102   108    92   635   274   193 ...  215 read time:34321 msec
<zren>   73|   995   100    17   909   102   108    92   635   274   193 ...  219 read time:2 msec
<zren>   74|   432   805   889    90   583   381    76    21   441   594 ...  457 read time:17721 msec
<zren>   75|   432   805   889    90   583   381    76    21   441   594 ...  457 read time:2 msec
<zren>   76|   932    74   829     3   523   296   316   195   612   110 ...  959 read time:37808 msec
<zren>   77|   932    74   829     3   523   296   316   195   612   110 ...  959 read time:2 msec
<zren> 3. does it sound reasonable? looking forward your opinion, hah
renzhengeek commented 8 years ago
1766         /*
1767          * We're fine letting folks race truncates and extending
1768          * writes with read across the cluster, just like they can
1769          * locally. Hence no rw_lock during read.
1770          * 
1771          * Take and drop the meta data lock to update inode fields
1772          * like i_size. This allows the checks down below
1773          * generic_file_aio_read() a chance of actually working. 
1774          */
1775         ret = ocfs2_meta_lock_atime(inode, filp->f_vfsmnt, &lock_level);
1776         if (ret < 0) {
1777                 mlog_errno(ret);
1778                 goto bail;
1779         }